%\VignetteIndexEntry{A Guide to the unifDAG Package} %\VignetteEngine{knitr::knitr} %\VignetteKeyword{unifDAG} \documentclass[11pt]{article} \usepackage[english]{babel} \selectlanguage{english} \usepackage{xspace} \usepackage{graphicx} \usepackage{array} \usepackage{fancyhdr} \usepackage{hyperref} \usepackage{cite} \usepackage{url} \usepackage{amsmath,amssymb,amsfonts} \usepackage{multirow} \usepackage{enumitem} \usepackage{caption} \captionsetup[table]{skip=10pt} \usepackage{pifont} \newcommand{\cmark}{\ding{51}}% \newcommand{\xmark}{\ding{55}}% \usepackage{tikz} \tikzstyle{format} = [draw, thin, fill=blue!20] % DOCUMENT \begin{document} <>= UDversion <- packageDescription('unifDAG')$Version @ % title \begin{center} {\LARGE A Guide to the unifDAG Package for R} {\Large (unifDAG version: \Sexpr{UDversion}) } \bigskip {\large Markus Kalisch} \bigskip \today \end{center} % end title \tableofcontents <>= options(digits=7) options(width=64) require('knitr') library(unifDAG) opts_chunk$set( keep.source=FALSE, out.width='4.5in', fig.width=10, fig.height=6, fig.path = 'figure/', cache=FALSE, tidy=TRUE, fig.align='center', size='small') @ \section{Introduction} For some applications, such as obtaining good starting points for Markov Chain methods, it is desirable to sample a DAG uniformly from the space of all (labelled) DAGs given $p$ nodes. While it is trivial to sample labelled \emph{undirected} graphs with $p$ nodes uniformly, sampling DAGs uniformly is much more difficult. Suppose, for example, we want to sample uniformly from the labelled DAGs on two nodes $A$ and $B$. There are three such DAGs: \begin{itemize} \item $G_1: \ A \to B$ \item $G_2: \ A \leftarrow B$ \item $G_3: \ A \quad B\quad$ (i.e., without an edge between $A$ and $B$). \end{itemize} With uniform sampling each of these graphs should be sampled with probability $\frac{1}{3}$, i.e. $P(G_1) = P(G_2) = P(G_3) = \frac{1}{3}$. One naive approach could be first sampling a labelled undirected graph on $p$ nodes uniformly and then sampling uniformly from all DAGs on the sampled undirected graph (i.e. choosing a random permutation of the labels and thus fixing an order). Let us illustrate this using the previous example. On two nodes $A$ and $B$, there are two undirected graphs possible: $A \quad B$ and $A - B$. If we sample $A \quad B$ nothing needs to be oriented an we end up with $G_3$. If we sample $A - B$, we choose a random order of the nodes $A$ and $B$ and thus would obtain $G_1: A \to B$ and $G_2: A \leftarrow B$ each with conditional proability $0.5$. It is easy to see that with this approach $P(G_1)=0.25$, $P(G_2)=0.25$ and $P(G_3)=0.5$. Thus, this approach does not lead to a uniform distribution on the space of DAGs with $p$ nodes but overrepresents sparse DAGs. In particular, the empty DAG with $p$ nodes is much more likely than a particular complete DAG on $p$ nodes: While the empty and the complete undirected graph are equally likely, all permutations of the empty undirected graph will lead to the empty DAG, while every permutation of the complete undirected graph will lead to a different DAG. \section{Enumeration methods} This problem was solved in \cite{kuipers2015uniform} by relating each DAG to a sequence of outpoints (nodes with no incoming edges) and then to a composition of integers. The package \texttt{pcalg} implements this solution in two ways: Function \texttt{unifDAG()} performs the exact procedure using precomputed enumeration tables and is feasible for DAGs with up to 100 nodes. The function \texttt{unifDAG.approx()} uses the exact procedure for DAGs up to a specified number of nodes (option \texttt{n.exact}). For larger numbers of nodes an approximation is used instead. The accuracy of the approximation is based on the option \texttt{n.exact} and will be within the uniformity limits of a 32 (64) bit integer sampler when set to \texttt{n.exact=20} (\texttt{n.exact=40}). Thus, for practical purposes these approximations are indistiguishable from the exact solution but are much faster to compute. Both functions can optionally generate edge weights. In the following example we first sample a DAG (\texttt{dag1}) uniformly from the space of all DAGs with $p=10$ nodes using the exact method. Then, we sample a DAG (\texttt{dag2}) uniformly from the space of all DAGs with $p=150$ nodes using the approximate method. The option \texttt{n.exact=40} is used, so that the sampling procedure will match the exact sampling procedure on a 64-bit integer sampler. In both cases, the edge weights are sampled independently from $Uniform(0, s)$. <>= myWgtFun <- function(m,lB,uB) { runif(m,lB,uB) } set.seed(123) dag1 <- unifDAG(n = 10, weighted = TRUE, wFUN = list(myWgtFun, 0, 2)) dag2 <- unifDAG.approx(n = 150, n.exact = 40, weighted = TRUE, wFUN = list(myWgtFun, 0, 2)) @ Returning to our original example with two nodes, we see that function \texttt{unifDAG()} indeed samples $G_1$, $G_2$ and $G_3$ with roughly equal frequencies. <>= cnt <- c(0,0,0) ## count occurances of G1, G2 and G3 set.seed(123) for (i in 1:100) { g <- unifDAG(n = 2, weighted = FALSE) m <- as(g, "matrix") ## adjacency matrix if ( (m[2,1]==0) & (m[1,2]==0) ) { cnt[3] <- cnt[3] + 1 ## G3 } else if (m[2,1]==0) { cnt[1] <- cnt[1] + 1 ## G1 } else { cnt[2] <- cnt[2] + 1 ## G2 } } cnt @ % BIB \bibliography{mybib_unifDAG_vignette} \bibliographystyle{plainurl} \end{document}