Title: | Modelling Multivariate Data with Additive Bayesian Networks |
Version: | 3.1.9 |
Date: | 2025-06-26 |
Description: | The 'abn' R package facilitates Bayesian network analysis, a probabilistic graphical model that derives from empirical data a directed acyclic graph (DAG). This DAG describes the dependency structure between random variables. The R package 'abn' provides routines to help determine optimal Bayesian network models for a given data set. These models are used to identify statistical dependencies in messy, complex data. Their additive formulation is equivalent to multivariate generalised linear modelling, including mixed models with independent and identically distributed (iid) random effects. The core functionality of the 'abn' package revolves around model selection, also known as structure discovery. It supports both exact and heuristic structure learning algorithms and does not restrict the data distribution of parent-child combinations, providing flexibility in model creation and analysis. The 'abn' package uses Laplace approximations for metric estimation and includes wrappers to the 'INLA' package. It also employs 'JAGS' for data simulation purposes. For more resources and information, visit the 'abn' website. |
License: | GPL (≥ 3) |
URL: | https://r-bayesian-networks.org/, https://github.com/furrer-lab/abn |
BugReports: | https://github.com/furrer-lab/abn/issues |
Depends: | R (≥ 4.0.0) |
Imports: | doParallel, foreach, graph, lme4, mclogit, methods, nnet, Rcpp, Rgraphviz, rjags, stringi |
Suggests: | bookdown, boot, brglm, devtools (≥ 2.4.5), ggplot2, gridExtra, INLA, knitr, Matrix, MatrixModels (≥ 0.5.3), microbenchmark, R.rsp, RhpcBLASctl, rmarkdown, testthat (≥ 3.0.0), entropy, moments, R6 |
LinkingTo: | Rcpp, RcppArmadillo |
VignetteBuilder: | knitr |
Additional_repositories: | https://inla.r-inla-download.org/R/stable/ |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
LazyData: | TRUE |
RoxygenNote: | 7.3.2 |
SystemRequirements: | pkg-config, cmake, gsl, jpeg, gdal, geos, proj, udunits-2, openssl, libcurl, jags |
NeedsCompilation: | yes |
Packaged: | 2025-06-26 05:31:01 UTC; root |
Author: | Matteo Delucchi |
Maintainer: | Matteo Delucchi <matteo.delucchi@math.uzh.ch> |
Repository: | CRAN |
Date/Publication: | 2025-06-26 07:40:02 UTC |
abn
Package
Description
abn
is a collection of functions for fitting, selecting/learning, analyzing, reporting Additive Bayesian Networks.
Value
nothing.
General overview
What is abn: Bayesian network modeling is a data analysis technique that is ideally suited to messy, highly correlated, and complex data sets. This methodology is somewhat distinct from other forms of statistical modeling in that its focus is on structure discovery - determining an optimal graphical model that describes the inter-relationships in the underlying processes which generated the data. It is a multivariate technique and can be used for one or many dependent variables. This is a data-driven approach, as opposed to, rely only on subjective expert opinion to determine how variables of interest are inter-related (for example, structural equation modeling).
The R package abn is designed to fit additive Bayesian models to observational data sets. It contains routines to score Bayesian Networks based on Bayesian or information-theoretic formulation of generalized linear models. It is equipped with exact search and greedy search algorithms to select the best network. The Bayesian implementation supports random effects to control for one layer clustering. It supports a possible mixture of continuous, discrete, and count data and inputs of prior knowledge at a structural level.
The R package abn requires the R package Rgraphviz to work well. It is store outside of CRAN; see ‘Examples’ for the code to install the last version.
The web page r-bayesian-networks.org provides further case studies.
See also the files provided in the package directories inst/bootstrapping_example
and inst/old_vignette
for more details.
Author(s)
Maintainer: Matteo Delucchi matteo.delucchi@math.uzh.ch (ORCID)
Authors:
Reinhard Furrer reinhard.furrer@math.uzh.ch (ORCID)
Gilles Kratzer gilles.kratzer@gmail.com (ORCID)
Fraser Iain Lewis fraser.iain.lewis@gmail.com (ORCID)
Other contributors:
Jonas I. Liechti j-i-l@t4d.ch (ORCID) [contributor]
Marta Pittavino marta.pittavino@math.uzh.ch (ORCID) [contributor]
Kalina Cherneva kalinacherneva@gmail.com [contributor]
References
Kratzer G, Lewis F, Comin A, Pittavino M, Furrer R (2023). “Additive Bayesian Network Modeling with the R Package abn.” Journal of Statistical Software, 105(8), 1-41. https://doi.org/10.18637/jss.v105.i08.
Delucchi M, Liechti J, Spinner G, Furrer R (2024). “abn: Additive Bayesian Networks.” Journal of Open Source Software, Accepted for publication. R package version 3.1.3, https://joss.theoj.org/papers/1bbc43a2be86f5d3f831cedb5cf81812.
Lewis, F. I., and Ward, M. P. (2013). "Improving epidemiologic data analyses through multivariate regression modeling". Emerging themes in epidemiology, 10(1), 4.
Kratzer, G., Lewis, F.I., Willi, B., Meli, M.L., Boretti, F.S., Hofmann-Lehmann, R., Torgerson, P., Furrer, R. and Hartnack, S. (2020). "Bayesian Network Modeling Applied to Feline Calicivirus Infection Among Cats in Switzerland". Frontiers in Veterinary Science, 7, 73.
Delucchi M, Furrer R, Kratzer G, Lewis F, Liechti J, Pittavino M, Cherneva K (2024). "abn: Modelling Multivariate Data with Additive Bayesian Networks". R package version 3.1.3, https://CRAN.R-project.org/package=abn.
See Also
Useful links:
Report bugs at https://github.com/furrer-lab/abn/issues
Examples
## Citations:
print(citation('abn'), bibtex=TRUE)
## Installing the R package Rgraphviz:
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("Rgraphviz")
## README.md in the directory `bootstrapping_example/`:
# edit(file=paste0( path.package('abn'),'/bootstrapping_example/README.md'))
Prints start up message
Description
Prints start up message
Usage
.onAttach(lib, pkg)
Value
Prints startup message to the console
Examples
library(abn)
Print AIC of objects of class abnFit
Description
Print AIC of objects of class abnFit
Usage
## S3 method for class 'abnFit'
AIC(object, digits = 3L, verbose = TRUE, ...)
Arguments
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
Value
prints the AIC of the fitted model.
Print BIC of objects of class abnFit
Description
Print BIC of objects of class abnFit
Usage
## S3 method for class 'abnFit'
BIC(object, digits = 3L, verbose = TRUE, ...)
Arguments
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
Value
prints the BIC of the fitted model.
Documentation of C Functions
Description
This is mainly to circumvent issues in R CMD check.
Value
nothing.
Dataset related to Feline calicivirus infection among cats in Switzerland.
Description
The dataset is about the Feline calicivirus (FCV) infection among cats in Switzerland. FCV is a virus that occurs worldwide in domestic cats but also in exotic felids. FCV is a highly contagious virus that is the major cause of upper respiratory disease or cat flue that affects felids. This is a complex disease caused by different viral and bacterial pathogens, i.e., FCV, FHV-1, Mycoplasma felis, Chlamydia felis and Bordetella bronchiseptica. It can be aggravated by retrovirus infections such as FeLV and FIV. This composite dynamic makes it very interesting for a BN modeling approach. The data were collected between September 2012 and April 2013.
Usage
FCV
Format
An adapted data frame of the original dataset, which consists of 300 observations of 15 variables.
- FCV
Feline Calici Virus status (0/1).
- FHV_1
Feline Herpes Virus 1 status (0/1).
- C_felis
C-felis and Chlamydia felis status (0/1).
- M_felis
Mycoplasma felis status (0/1).
- B_bronchiseptica
B-bronchiseptica & Bordetella bronchispetica status (0/1).
- FeLV
feline leukosis virus status (0/1).
- FIV
feline immunodeficiency virus status (0/1).
- Gingivostomatitis
gingivostomatitis complex status (0/1).
- URTD
URTD complex (upper respiratory complex) (0/1).
- Vaccinated
vaccination status (0/1).
- Pedigree
pedigree (0/1).
- Outdoor
outdoor access (0/1).
- Sex
sex and castrated status (M, MN, F, FS).
- GroupSize
number of cats in the group (counts).
- Age
age in year (continuous)\.
References
Berger, A., Willi, B., Meli, M. L., Boretti, F. S., Hartnack, S., Dreyfus, A., ... and Hofmann-Lehmann, R. (2015). Feline calicivirus and other respiratory pathogens in cats with Feline calicivirus-related symptoms and in clinically healthy cats in Switzerland. BMC Veterinary Research, 11(1), 282.
abn Version Information
Description
abn.version()
provides detailed information about the running version of abn
or the abn components.
Usage
abn.version(what = c("abn", "system"))
Arguments
what |
detailed information about the version of abn or the system (see returns). |
Value
abn.version(what = "system")
is a list with character-string components
- R
R.version.string
- abn
essentially
abn.version$version.string
- GSL, JAGS, INLA
version numbers thereof
abn.version(what = "abn")
is a list with character-string components
- status
the status of the version (e.g.,
"beta"
)- major
the major version number
- minor
the minor version number
- year
the year the version was released
- month
the month the version was released
- day
the day the version was released
- version.string
a
character
string concatenating the info above, useful for plotting, etc.
abn.version
is a list of class "simple.list"
which has a print
method.
See Also
Examples
abn.version()$version.string
## Not run:
abn.version("system")
## End(Not run)
Dataset related to average daily growth performance and abattoir findings in pigs commercial production.
Description
The case study dataset is about growth performance and abattoir findings in pigs commercial production in a selected set of 15 Canadian farms collected in March 1987.
Usage
adg
Format
An adapted data frame of the original dataset which consists of 341 observations of 8 variables and a grouping variable (farm).
- AR
presence of atrophic rhinitis.
- pneumS
presence of moderate to severe pneumonia.
- female
sex of the pig (1=female, 0=castrated).
- livdam
presence of liver damage (parasite-induced white spots).
- eggs
presence of fecal/gastrointestinal nematode eggs at time of slaughter.
- wormCount
count of nematodes in small intestine at time of slaughter.
- age
days elapsed from birth to slaughter (days).
- adg
average daily weight gain (grams).
- farm
farm ID.
Details
When using the data to fit an additive Bayesian network,
the variables AR
, pneumS
, female
, livdam
,
eggs
are considered binomial, wormcount
Poisson,
age
and adg
Gaussian.
The variable farm
can be used to adjust for grouping.
References
Kratzer, G., Lewis, F.I., Comin, A., Pittavino, M. and Furrer, R. (2019). "Additive Bayesian Network Modelling with the R Package abn". arXiv preprint arXiv:1911.09006. Dohoo, Ian Robert, Wayne Martin, and Henrik Stryhn. Veterinary epidemiologic research. No. V413 DOHv. Charlottetown, Canada: AVC Incorporated, 2003.
Bugs code for Bernoulli response
Description
Bugs model for a Binomial response X
in a single trial:
X \sim \mathcal{B}(n=1, p) = \mathcal{Bernoulli}(p)
.
Usage
bern_bugs(nodename, nodesintercept, parentnames, parentcoefs)
bern_bugsGroup(nodename, nodesintercept, parentnames, parentcoefs, sigma_alpha)
Arguments
nodename |
character string of response variable name. |
nodesintercept |
overall mean of response. Parameter from fixed-effects intercept. |
parentnames |
single character string (for one parent) or vector of characters (for multiple parent nodes) with parent node (predictor variables) names. |
parentcoefs |
overall slope for each predictor (parent node) variable (fixed-effects). |
sigma_alpha |
between-group variance. Parameter from random-effects intercept. |
Value
Bugs model returned as stdout.
Functions
-
bern_bugsGroup()
: Bugs code for Bernoulli response with varying intercept
See Also
Examples
bern_bugs(nodename = "a",
parentnames = c("b", "c"),
nodesintercept = c(0.318077),
parentcoefs = list("b"=c(b=0.3059395),
"c"=c(c=0.5555)))
Control the iterations in buildScoreCache
Description
Allow the user to set restrictions in the buildScoreCache
for both the Bayesian and the MLE approach.
Control function similar to fit.control
.
Usage
build.control(
method = "bayes",
max.mode.error = 10,
mean = 0,
prec = 0.001,
loggam.shape = 1,
loggam.inv.scale = 5e-05,
max.iters = 100,
epsabs = 1e-07,
error.verbose = FALSE,
trace = 0L,
epsabs.inner = 1e-06,
max.iters.inner = 100,
finite.step.size = 1e-07,
hessian.params = c(1e-04, 0.01),
max.iters.hessian = 10,
max.hessian.error = 0.5,
factor.brent = 100,
maxiters.hessian.brent = 100,
num.intervals.brent = 100,
n.grid = 250,
ncores = 1,
cluster.type = "FORK",
max.irls = 100,
tol = 1e-08,
tolPwrss = 1e-07,
check.rankX = "message+drop.cols",
check.scaleX = "message+rescale",
check.conv.grad = "message",
check.conv.singular = "message",
check.conv.hess = "message",
xtol_abs = 1e-06,
ftol_abs = 1e-06,
trace.mblogit = FALSE,
catcov.mblogit = "free",
epsilon = 1e-06,
seed = 9062019L
)
Arguments
method |
a character that takes one of two values: "bayes" or "mle". Overrides |
max.mode.error |
if the estimated modes from INLA differ by a factor of |
mean |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec |
the prior precision ( |
loggam.shape |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
max.iters |
total number of iterations allowed when estimating the modes in Laplace approximation. passed to |
epsabs |
absolute error when estimating the modes in Laplace approximation for models with no random effects. Passed to |
error.verbose |
logical, additional output in the case of errors occurring in the optimization. Passed to |
trace |
Non-negative integer. If positive, tracing information on the progress of the "L-BFGS-B" optimization is produced. Higher values may produce more tracing information. (There are six levels of tracing. To understand exactly what these do see the source code.). Passed to |
epsabs.inner |
absolute error in the maximization step in the (nested) Laplace approximation for each random effect term. Passed to |
max.iters.inner |
total number of iterations in the maximization step in the nested Laplace approximation. Passed to |
finite.step.size |
suggested step length used in finite difference estimation of the derivatives for the (outer) Laplace approximation when estimating modes. Passed to |
hessian.params |
a numeric vector giving parameters for the adaptive algorithm, which determines the optimal stepsize in the finite-difference estimation of the hessian. First entry is the initial guess, second entry absolute error. Passed to |
max.iters.hessian |
integer, maximum number of iterations to use when determining an optimal finite difference approximation (Nelder-Mead). Passed to |
max.hessian.error |
if the estimated log marginal likelihood when using an adaptive 5pt finite-difference rule for the Hessian differs by more than |
factor.brent |
if using Brent-Dekker root bracketing method then define the outer most interval end points as the best estimate of |
maxiters.hessian.brent |
maximum number of iterations allowed in the Brent-Dekker method. Passed to |
num.intervals.brent |
the number of initial different bracket segments to try in the Brent-Dekker method. Passed to |
n.grid |
recompute density on an equally spaced grid with |
ncores |
The number of cores to parallelize to, see ‘Details’. If >0, the number of CPU cores to be used. -1 for all available -1 core. Only for |
cluster.type |
The type of cluster to be used, see |
max.irls |
total number of iterations for estimating network scores using an Iterative Reweighed Least Square algorithm. Is this DEPRECATED? |
tol |
real number giving the minimal tolerance expected to terminate the Iterative Reweighed Least Square algorithm to estimate network score. Passed to |
tolPwrss |
numeric scalar passed to |
check.rankX |
character passed to |
check.scaleX |
character passed to |
check.conv.grad |
character passed to |
check.conv.singular |
character passed to |
check.conv.hess |
character passed to |
xtol_abs |
Defaults to 1e-6 stop on small change of parameter value. Only for |
ftol_abs |
Defaults to 1e-6 stop on small change in deviance. Similar to |
trace.mblogit |
logical indicating if output should be produced for each iteration. Directly passed to |
catcov.mblogit |
Defaults to "free" meaning that there are no restrictions on the covariances of random effects between the logit equations. Set to "diagonal" if random effects pertinent to different categories are uncorrelated or "single" if random effect variances pertinent to all categories are identical. |
epsilon |
Defaults to 1e-8. Positive convergence tolerance |
seed |
a non-negative integer which sets the seed in |
Details
Parallelization over all children is possible via the function foreach
of the package doParallel. ncores=0
or ncores=1
use single threaded foreach
. ncores=-1
uses all available cores but one.
Value
Named list according the provided arguments.
See Also
Other buildScoreCache:
buildScoreCache()
Examples
ctrlmle <- abn::build.control(method = "mle",
ncores = 0,
cluster.type = "PSOCK",
max.irls = 100,
tol = 10^-11,
tolPwrss = 1e-7,
check.rankX = "message+drop.cols",
check.scaleX = "message+rescale",
check.conv.grad = "message",
check.conv.singular = "message",
check.conv.hess = "message",
xtol_abs = 1e-6,
ftol_abs = 1e-6,
trace.mblogit = FALSE,
catcov.mblogit = "free",
epsilon = 1e-6,
seed = 9062019L)
ctrlbayes <- abn::build.control(method = "bayes",
max.mode.error = 10,
mean = 0, prec = 0.001,
loggam.shape = 1,
loggam.inv.scale = 5e-05,
max.iters = 100,
epsabs = 1e-07,
error.verbose = FALSE,
epsabs.inner = 1e-06,
max.iters.inner = 100,
finite.step.size = 1e-07,
hessian.params = c(1e-04, 0.01),
max.iters.hessian = 10,
max.hessian.error = 0.5,
factor.brent = 100,
maxiters.hessian.brent = 100,
num.intervals.brent = 100,
tol = 10^-8,
seed = 9062019L)
Build a cache of goodness of fit metrics for each node in a DAG, possibly subject to user-defined restrictions
Description
Iterates over all valid parent combinations - subject to ban, retain, and max.parent
limits - for each node, or a subset of nodes, and computes a cache of scores (AIC, BIC, log marginal likelihood).
This cache can then be used in different DAG structural search algorithms.
Usage
buildScoreCache(data.df = NULL,
data.dists = NULL,
method = "bayes",
group.var = NULL,
adj.vars = NULL,
cor.vars = NULL,
dag.banned = NULL,
dag.retained = NULL,
max.parents = NULL,
which.nodes = NULL,
defn.res = NULL,
centre = TRUE,
dry.run = FALSE,
control = NULL,
verbose = FALSE,
debugging = FALSE,
...)
buildScoreCache.bayes(
data.df = NULL,
data.dists = NULL,
group.var = NULL,
cor.vars = NULL,
dag.banned = NULL,
dag.retained = NULL,
max.parents = NULL,
which.nodes = NULL,
defn.res = NULL,
dry.run = FALSE,
centre = TRUE,
force.method = NULL,
mylist = NULL,
grouped.vars = NULL,
group.ids = NULL,
control = build.control(method = "bayes"),
verbose = FALSE,
debugging = FALSE
)
forLoopContentBayes(
row.no = NULL,
children = NULL,
node.defn = NULL,
dag.m = NULL,
force.method = NULL,
data.df = NULL,
data.dists = NULL,
var.types = NULL,
control = NULL,
grouped.vars = NULL,
group.ids = NULL,
verbose = FALSE
)
forLoopContent(
row.num,
mycache,
data.dists,
data.df.multi,
adj.vars,
data.df,
data.df.lvl,
group.var,
group.ids,
control,
n,
verbose
)
buildScoreCache.mle(
data.df = NULL,
data.dists = NULL,
max.parents = NULL,
adj.vars = NULL,
cor.vars = NULL,
dag.banned = NULL,
dag.retained = NULL,
which.nodes = NULL,
centre = TRUE,
defn.res = NULL,
dry.run = FALSE,
verbose = FALSE,
debugging = FALSE,
force.method = NULL,
group.var = NULL,
grouped.vars = NULL,
group.ids = NULL,
control = build.control(method = "mle")
)
Arguments
data.df |
a data frame containing the data used for learning each node. Binary variables must be declared as factors. |
data.dists |
a named list giving the distribution for each node in the network, see ‘Details’. |
method |
should a "Bayes" or "mle" approach be used, see ‘Details’. |
group.var |
variable name for nodes to be fitted as variable intercept as in a mixed-effects model ("Bayes" and "mle") and gives the column name in |
adj.vars |
a character vector giving the column names in |
cor.vars |
a character vector giving the column names in |
dag.banned |
a matrix or a formula statement (see ‘Details’ for format) defining which arcs are not permitted - banned - see ‘Details’ for format. Note that colnames and rownames must be set, otherwise same row/column names as data.df will be assumed. If set as NULL an empty matrix is assumed. |
dag.retained |
a matrix or a formula statement (see ‘Details’ for format) defining which arcs are must be retained in any model search, see ‘Details’ for format. Note that colnames and rownames must be set, otherwise same row/column names as data.df will be assumed. If set as NULL an empty matrix is assumed. |
max.parents |
a constant or named list giving the maximum number of parents allowed, the list version allows this to vary per node (only for |
which.nodes |
a vector giving the column indices of the variables to be included, if ignored all variables are included. This is used to subset |
defn.res |
an optional user-supplied list of child and parent combinations, see ‘Details’. |
centre |
should the observations in each Gaussian node first be standardized to mean zero and standard deviation one, defaults to TRUE. |
dry.run |
if TRUE then a list of the child nodes and parent combinations are returned but without estimation of node scores (log marginal likelihoods). |
control |
a list of control parameters. See |
verbose |
if TRUE then provides some additional output. |
debugging |
if |
... |
additional arguments passed for optimization. |
force.method |
"notset", "INLA" or "C". This is specified in |
mylist |
result returned from |
grouped.vars |
result returned from |
group.ids |
result returned from |
row.no |
The row number of the child-parent combination to be processed. |
children |
vector of child node integers. |
node.defn |
child-parent combination table. |
dag.m |
Empty adjacency matrix. |
var.types |
vector of numeric encoding of distribution types. See |
row.num |
number of child-node (mostly corresponds to child node index e.g. in dag). |
mycache |
prepared cache. |
data.df.multi |
extended data.df for one-hot-encoded multinomial variables. |
data.df.lvl |
copy of original |
n |
corresponds to |
Details
The function computes a cache of scores based on possible restrictions (maximum complexity, retained and banned arcs).
This function is very similar to fitAbn
- see that help page for details of the type of models used and in particular data.dists
specification - but rather than fit a single complete DAG buildScoreCache
iterates over all different parent combinations for each node, creating a cache of scores.
This cache of score could be used to select the optimal network in other function such as searchHeuristic
or mostProbable
.
‘dag.banned’ and ‘dag.retained’ specify which arcs are forced to be absent or present in the DAG, respectively.
If provided as matrix, rows represent child nodes and columns their parents for elements with a value $=1$.
Two very different approaches are implemented: a Bayesian and frequentist approaches. They can be selected using the method
argument.
If method="bayes"
:
This function is used to calculate all individual node scores (log marginal likelihoods).
Internal code is used by default for numerical estimation in nodes without random effects, and INLA is the default for nodes with random effects.
This default behavior can be overridden using control=list(max.mode.error=...)
. The default is max.mode.error=10
, which means that the modes estimated from INLA output must be within 10\
Otherwise, the internal code is used rather than INLA.
To force the use of INLA on all nodes, use max.mode.error=100, which then ignores this check, to force the use of internal code then use max.mode.error=0
. For more detials, see fitAbn.
The variable which.nodes
is to allow the computation to be separated by node, for example, over different CPUs using say R CMD BATCH
.
This may useful and indeed likely essential with larger problems or those with random effects.
Note that in this case, the results must then be combined back into a list of identical formats to that produced by an individual call to buildScoreCache
,
comprising of all nodes (in the same order as the columns in data.df
) before sending it to any search routines. Using dry.run
can be useful here.
If method="mle"
:
This function is used to calculate all individual information-theoretic node scores. The possible information-theoretic based network scores computed in buildScoreCache
are the maximum likelihood (mlik, called marginal likelihood in this context as it is computed node wise),
the Akaike Information Criteria (aic), the Bayesian Information Criteria (bic) and the Minimum distance Length (mdl). The classical definitions of those metrics are given in Kratzer and Furrer (2018). This function computes a cache that can be fed into a model search algorithm.
The numerical routines used here are identical to those in fitAbn
and see that help page for further details and also the quality assurance section on the r-bayesian-networks.org of the abn website for more details.
Value
A named list of class abnCache
.
children
a vector of the child node indexes (from 1) corresponding to the columns in data.df (ignoring any grouping variable)
node.defn
a matrix giving the parent combination
mlik
log marginal likelihood value for each node combination. If the model cannot be fitted then NA is returned.
error.code
if non-zero then either the root finding algorithm (glm nodes) or the maximisation algorithm (glmm nodes) terminated in an unusual way suggesting a possible unreliable result, or else the finite difference hessian estimation produced and error or warning (glmm nodes). NULL if
method="mle"
.error.code.desc
a textual description of the
error.code
. NULL ifmethod="mle"
hessian.accuracy
An estimate of the error in the final mlik value for each parent combination - this is the absolute difference between two different adaptive finite difference rules where each computes the mlik value. NULL if
method="mle"
data.df
a version of the original data (for internal use only in other functions such as
mostProbable
).data.dists
the named list of nodes distributions (for internal use only in other functions such as
mostProbable
).max.parents
the maximum number of parents (for internal use only in other functions such as
mostProbable
).dag.retained
the matrix encoding the retained arcs (for internal use only in other functions such as
searchHeuristic
).dag.banned
the matrix encoding the banned arcs (for internal use only in other functions such as
searchHeuristic
).aic
aic value for each node combination. If the model cannot be fitted then NaN is returned. NULL if
method="bayes"
.bic
bic value for each node combination. If the model cannot be fitted then NaN is returned. NULL if
method="bayes"
.mdl
mdl value for each node combination. If the model cannot be fitted then NaN is returned. NULL if
method="bayes"
.
Named vector of results from one child-parent combination subject to the row.no
.
The names are:
- childParentCombNo
The row number of the child-parent combination in the
node.defn
table. This must be the same as the row number innode.defn
: careful ifbuildScoreCache.bayes()
is run in parallel!- mlik
The marginal log-likelihood of the child-parent combination.
- error.code
The error code returned by
inla()
.- hessian.accuracy
The accuracy of the Hessian matrix returned by
inla()
.- used.INLA
A logical value indicating whether
inla()
was used to fit the model.
A list that will be passed to buildScoreCache.mle
.
Functions
-
buildScoreCache.bayes()
: Fit a given DAG to data with method="bayes". -
forLoopContentBayes()
: Internal function called bybuildScoreCache.bayes()
. -
forLoopContent()
: Internal function called bybuildScoreCache.mle()
. -
buildScoreCache.mle()
: Fit a given DAG to data with method="mle".
References
Kratzer, Gilles, Fraser Lewis, Arianna Comin, Marta Pittavino, and Reinhard Furrer. “Additive Bayesian Network Modeling with the R Package Abn.” Journal of Statistical Software 105 (January 28, 2023): 1–41. https://doi.org/10.18637/jss.v105.i08.
Kratzer, G., Lewis, F.I., Comin, A., Pittavino, M., and Furrer, R. (2019). "Additive Bayesian Network Modelling with the R Package abn". arXiv:1911.09006.
Kratzer, G., and Furrer, R., (2018). "Information-Theoretic Scoring Rules to Learn Additive Bayesian Network Applied to Epidemiology". arXiv:1808.01126.
Lewis, F. I., and McCormick, B. J. J. (2012). "Revealing the complexity of health determinants in resource poor settings". American Journal Of Epidemiology. doi:10.1093/aje/KWS183).
Further information about abn can be found at: r-bayesian-networks.org.
See Also
Other buildScoreCache:
build.control()
Other Bayes:
calc.node.inla.glm()
,
calc.node.inla.glmm()
,
fitAbn()
,
getmarginals()
Examples
## Simple example
# Generate data
N <- 1e6
mydists <- list(a="gaussian",
b="gaussian",
c="gaussian")
a <- rnorm(n = N, mean = 0, sd = 1)
b <- 1 + 2*rnorm(n = N, mean = 5, sd = 1)
c <- 2 + 1*a + 2*b + rnorm(n = N, mean = 2, sd = 1)
mydf <- data.frame("a" = scale(a),
"b" = scale(b),
"c" = scale(c))
# ABN with MLE
mycache.mle <- buildScoreCache(data.df = mydf,
data.dists = mydists,
method = "mle",
max.parents = 2)
dag.mle <- mostProbable(score.cache = mycache.mle,
max.parents = 2)
myfit.mle <- fitAbn(object = dag.mle,
method = "mle",
max.parents = 2)
plot(myfit.mle)
## Not run:
# ABN with Bayes
if(requireNamespace("INLA", quietly = TRUE)){
# Run only if INLA is available
mycache.bayes <- buildScoreCache(data.df = mydf,
data.dists = mydists,
method = "bayes",
max.parents = 2)
dag.bayes <- mostProbable(score.cache = mycache.bayes,
max.parents = 2)
myfit.bayes <- fitAbn(object = dag.bayes,
method = "bayes",
max.parents = 2)
plot(myfit.bayes)
}
# Compare MLE and Bayes with lm
mymod.lm <- lm(c ~ a + b, data = mydf)
summary(mymod.lm)
##################################################################################################
## Example 1 - "mle" vs. "bayes" and the later with using the internal C routine compared to INLA
##################################################################################################
# Subset of the build-in dataset, see ?ex0.dag.data
mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols
# setup distribution list for each node
mydists <- list(b1="binomial", b2="binomial", g1="gaussian",
g2="gaussian", b3="binomial", g3="gaussian")
# Structural constraints
## ban arc from b2 to b1
## always retain arc from g2 to g1
## parent limits - can be specified for each node separately
max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2)
# now build the cache of pre-computed scores accordingly to the structural constraints
res.c <- buildScoreCache(data.df=mydat,
data.dists=mydists,
dag.banned= ~b1|b2,
dag.retained= ~g1|g2,
max.parents=max.par,
method="bayes")
# repeat but using R-INLA. The mlik's should be virtually identical.
# Force using of INLA build.control(max.mode.error=100)
if(requireNamespace("INLA", quietly = TRUE)){
res.inla <- buildScoreCache(data.df=mydat,
data.dists=mydists,
dag.banned= ~b1|b2, # ban arc from b2 to b1
dag.retained= ~g1|g2, # always retain arc from g2 to g1
max.parents=max.par,
method="bayes",
control=build.control(max.mode.error=100))
## comparison - very similar
difference <- res.c$mlik - res.inla$mlik
summary(difference)
}
# Comparison Bayes with MLE (unconstrained):
res.mle <- buildScoreCache(data.df=mydat, data.dists=mydists,
max.parents=3, method="mle")
res.abn <- buildScoreCache(data.df=mydat, data.dists=mydists,
max.parents=3, method="bayes")
# of course different, but same order:
plot(-res.mle$bic, res.abn$mlik)
#################################################################
## Example 2 - mle with several cores
#################################################################
## Many variables, few observations
mydat <- ex0.dag.data
mydists <- as.list(rep(c("binomial", "gaussian", "poisson"), each=10))
names(mydists) <- names(mydat)
system.time({
res.mle1 <- buildScoreCache(data.df=mydat,
data.dists=mydists,
max.parents=2,
method="mle",
control = build.control(method = "mle",
ncores=1))})
system.time({
res.mle2 <- buildScoreCache(data.df=mydat,
data.dists=mydists,
max.parents=2,
method="mle",
control = build.control(method = "mle",
ncores=2))})
#################################################################
## Example 3 - grouped data - random effects example e.g. glmm
#################################################################
## this data comes with abn see ?ex3.dag.data
mydat <- ex3.dag.data[,c("b1","b2","b3","b4","b5","b6","b7",
"b8","b9","b10","b11","b12","b13", "group")]
mydists <- list(b1="binomial", b2="binomial", b3="binomial",
b4="binomial", b5="binomial", b6="binomial", b7="binomial",
b8="binomial", b9="binomial", b10="binomial",b11="binomial",
b12="binomial", b13="binomial" )
max.par <- 2
## in this example INLA is used as default since these are glmm nodes
## when running this at node-parent combination 71 the default accuracy check on the
## INLA modes is exceeded (default is a max. of 10 percent difference from
## modes estimated using internal code) and a message is given that internal code
## will be used in place of INLA's results.
mycache.bayes <- buildScoreCache(data.df=mydat,
data.dists=mydists,
group.var="group",
method = "bayes",
max.parents=max.par)
dag.bayes <- mostProbable(score.cache=mycache.bayes)
plot(dag.bayes)
mycache.mle <- buildScoreCache(data.df=mydat,
data.dists=mydists,
group.var="group",
method = "mle",
max.parents=max.par)
dag.mle <- mostProbable(score.cache=mycache.mle)
plot(dag.mle)
## End(Not run)
Fit a given regression using INLA
Description
Internal wrapper to INLA and are called from fitAbn.bayes
and buildScoreCache.bayes
.
Usage
calc.node.inla.glm(
child.loc = NULL,
dag.m.loc = NULL,
data.df.loc = NULL,
data.dists.loc = NULL,
ntrials.loc = NULL,
exposure.loc = NULL,
compute.fixed.loc = NULL,
mean.intercept.loc = NULL,
prec.intercept.loc = NULL,
mean.loc = NULL,
prec.loc = NULL,
loggam.shape.loc = NULL,
loggam.inv.scale.loc = NULL,
verbose.loc = FALSE,
nthreads = NULL
)
Arguments
child.loc |
index of current child node. |
dag.m.loc |
dag as matrix. |
data.df.loc |
data df, |
data.dists.loc |
list of distributions. |
ntrials.loc |
|
exposure.loc |
|
compute.fixed.loc |
TRUE. |
mean.intercept.loc |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec.intercept.loc |
the prior precision for all the Gaussian additive term for each node. INLA argument |
mean.loc |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec.loc |
the prior precision for all the Gaussian additive term for each node. INLA argument |
loggam.shape.loc |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale.loc |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
verbose.loc |
FALSE to not print additional output. |
nthreads |
number of threads to use for INLA. Default is |
Value
If INLA failed, FALSE or an error is returned. Otherwise, the direct output from INLA is returned.
See Also
Other Bayes:
buildScoreCache()
,
calc.node.inla.glmm()
,
fitAbn()
,
getmarginals()
Fit a given regression using INLA
Description
Internal wrapper to INLA and are called from fitAbn.bayes
and buildScoreCache.bayes
.
Usage
calc.node.inla.glmm(
child.loc = NULL,
dag.m.loc = NULL,
data.df.loc = NULL,
data.dists.loc = NULL,
ntrials.loc = NULL,
exposure.loc = NULL,
compute.fixed.loc = NULL,
mean.intercept.loc = NULL,
prec.intercept.loc = NULL,
mean.loc = NULL,
prec.loc = NULL,
loggam.shape.loc = NULL,
loggam.inv.scale.loc = NULL,
verbose.loc = FALSE,
nthreads = NULL
)
Arguments
child.loc |
index of current child node. |
dag.m.loc |
dag as matrix. |
data.df.loc |
data df, |
data.dists.loc |
list of distributions. |
ntrials.loc |
|
exposure.loc |
|
compute.fixed.loc |
TRUE. |
mean.intercept.loc |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec.intercept.loc |
the prior precision for all the Gaussian additive term for each node. INLA argument |
mean.loc |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec.loc |
the prior precision for all the Gaussian additive term for each node. INLA argument |
loggam.shape.loc |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale.loc |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
verbose.loc |
FALSE to not print additional output. |
nthreads |
number of threads to use for INLA. Default is |
Value
If INLA failed, FALSE or an error is returned. Otherwise, the direct output from INLA is returned.
See Also
Other Bayes:
buildScoreCache()
,
calc.node.inla.glm()
,
fitAbn()
,
getmarginals()
Bugs code for Categorical response
Description
Bugs code for Categorical response
Usage
categorical_bugs(
nodename,
nodesCatIdx,
parentnames,
nodesintercepts,
parentcoefs
)
categorical_bugsGroup(
nodename,
nodesCatIdx,
nodesintercepts,
parentnames,
parentcoefs,
sigma,
sigma_alpha
)
Arguments
nodename |
character string of response variable name. |
nodesCatIdx |
integer vector of length |
parentnames |
single character string (for one parent) or vector of characters (for multiple parent nodes) with parent node (predictor variables) names. |
nodesintercepts |
overall mean of response. Parameter from fixed-effects intercept. |
parentcoefs |
overall slope for each predictor (parent node) variable (fixed-effects). |
sigma |
within-group variance. Parameter from random-effects residual. |
sigma_alpha |
between-group variance-covariance matrix. Parameters from random-effects intercept. |
Details
The output of fitAbn
with method = "mle"
is based on
the output of logistic regression models fit with either lm
,
glm
, glmer
, multinom
,
mblogit
or internal irls
methods.
They all use the first factor level as reference level.
Therefore, nodesCatIdx
starts with index 2
and not 1
.
nodesintercepts
and parentcoefs
refer to the values of
(Intercept)
and Estimate
of the respective model output.
Predictor names build the keys in parentcoef
.
Value
Bugs model returned as stdout.
Functions
-
categorical_bugsGroup()
: Bugs code for Categorical response with varying intercept
See Also
Examples
# A -> B
# Where B is a categorical variable with 4 levels.
categorical_bugs(nodename = "b",
nodesCatIdx = c(2, 3, 4),
parentnames = "a",
nodesintercepts = c(2.188650, 3.133928, 3.138531),
parentcoefs = list("a"=c(a=1.686432, a=3.134161, a=5.052104)))
Simple check on the control parameters
Description
Simple check on the control parameters
Usage
check.valid.buildControls(control, method = "bayes", verbose = FALSE)
Arguments
control |
list of control arguments with new parameters supplied to |
method |
"bayes" or "mle" strategy from argument |
verbose |
when TRUE additional information is printed. Defaults to FALSE. |
Value
list with all control arguments with respect to the method but with new values.
Set of simple commonsense validity checks on the directed acyclic graph definition matrix
Description
Set of simple commonsense validity checks on the directed acyclic graph definition matrix
Usage
check.valid.dag(
dag = NULL,
data.df = NULL,
is.ban.matrix = FALSE,
group.var = NULL
)
Arguments
dag |
Named square matrix or a formula statement specifying a directed acyclic graph. If NULL an empty network is assumed. |
data.df |
data frame with names corresponding to variable/node names. |
is.ban.matrix |
Diagonals and cycles are not checked for ban-matrices. Defaults to FALSE. |
group.var |
not yet implemented |
Value
dag as named square matrix
Set of simple commonsense validity checks on the data.df and data.dists arguments
Description
Set of simple commonsense validity checks on the data.df and data.dists arguments
Usage
check.valid.data(data.df = NULL, data.dists = NULL, group.var = NULL)
Arguments
data.df |
data frame with names corresponding to variable/node names. |
data.dists |
list specifying each columns distribution type. Names correspond to column names and values must be one of "gaussian", "binomial", "poisson", "multinomial". |
group.var |
not yet implemented |
Value
list of indexes for each distribution type
a list of the indexes for each distribution type
Simple check on the control parameters
Description
Simple check on the control parameters
Usage
check.valid.fitControls(control, method = "bayes", verbose = FALSE)
Arguments
control |
list of control arguments with new parameters supplied to |
method |
"bayes" or "mle" strategy from argument |
verbose |
when TRUE additional information is printed. Defaults to FALSE. |
Value
list with all control arguments with respect to the method but with new values.
Simple check on the grouping variable
Description
Simple check on the grouping variable
Usage
check.valid.groups(
group.var = NULL,
data.df = NULL,
cor.vars = NULL,
verbose = FALSE
)
Arguments
group.var |
Name of grouping variable. |
data.df |
data frame of all variables including the variable specified in |
cor.vars |
Name(s) of variables to which the grouping should be applied to. |
verbose |
when TRUE additional information is printed. Defaults to FALSE. |
Value
list with data.df, indexes of variables to which the grouping should be applied to and the associated group for each observation as integer.
Set of simple checks on the given parent limits
Description
Set of simple checks on the given parent limits
Usage
check.valid.parents(data.df = NULL, max.parents = NULL, group.var = NULL)
Arguments
data.df |
data frame |
max.parents |
single integer for one overall max parent limit.
A list with names corresponding to variable/column names of |
group.var |
not yet implemented |
Value
numeric vector of max number of parents per variable
Set of simple checks on the list given as parent limits
Description
Set of simple checks on the list given as parent limits
Usage
check.which.valid.nodes(data.df = NULL, which.nodes = NULL, group.var = NULL)
Arguments
data.df |
data frame |
which.nodes |
a vector giving the column indices of the variables to be included, if ignored all variables are included. |
group.var |
not yet implemented |
Value
integer vector of column indexes of valid nodes in data.df
Print coefficients of objects of class abnFit
Description
Print coefficients of objects of class abnFit
Usage
## S3 method for class 'abnFit'
coef(object, digits = 3L, verbose = TRUE, ...)
Arguments
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
Value
prints the coefficients of the fitted model.
Compare two DAGs or EGs
Description
Function that returns multiple graph metrics to compare two DAGs or essential graphs, known as confusion matrix or error matrix.
Usage
compareDag(ref, test, node.names = NULL, checkDAG = TRUE)
Arguments
ref |
a matrix or a formula statement (see details for format) defining
the reference network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
test |
a matrix or a formula statement (see details for format) defining
the test network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
node.names |
a vector of names if the DAGs are given via formula, see details. |
checkDAG |
should the DAGs be tested for DAGs (default). |
Details
This R function returns standard Directed Acyclic Graph comparison metrics. In statistical classification, those metrics are known as a confusion matrix or error matrix.
Those metrics allows visualization of the difference between different DAGs. In the case where comparing TRUTH to learned structure or two learned structures, those metrics allow the user to estimate the performance of the learning algorithm. In order to compute the metrics, a contingency table is computed of a pondered difference of the adjacency matrices od the two graphs.
The ref
or test
can be provided using a formula statement
(similar to GLM input).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separtor between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
To test for essential graphs (or graphs) in general, the test for DAG
need to be switched off checkDAG=FALSE
.
The function compareEG()
is a wrapper to compareDag(, checkDAG=FALSE)
.
Value
TP
True Positive
TN
True Negative
FP
False Positive
FN
False Negative
CP
Condition Positive (ref)
CN
Condition Negative (ref)
PCP
Predicted Condition Positive (test)
PCN
Predicted Condition Negative (test)
True Positive Rate
=\frac{\sum TP}{\sum CP}
False Positive Rate
=\frac{\sum FP}{\sum CN}
Accuracy
=\frac{\sum TP + \sum TN}{Total population}
G-measure
\sqrt {{\frac {TP}{TP+FP}}\cdot {\frac {TP}{TP+FN}}}
F1-Score
\frac{2 \sum TP}{2 \sum TP + \sum FN + \sum FP}
Positive Predictive Value
\frac{\sum TP}{\sum PCP}
False Ommision Rate
\frac{\sum FN}{\sum PCN}
Hamming-Distance
Number of changes needed to match the matrices.
References
Sammut, Claude, and Geoffrey I. Webb. (2017). Encyclopedia of machine learning and data mining. Springer.
Examples
test.m <- matrix(data = c(0,1,0,
0,0,0,
1,0,0), nrow = 3, ncol = 3)
ref.m <- matrix(data = c(0,0,0,
1,0,0,
1,0,0), nrow = 3, ncol = 3)
colnames(test.m) <- rownames(test.m) <- colnames(ref.m) <- rownames(ref.m) <- c("a", "b", "c")
unlist(compareDag(ref = ref.m, test = test.m))
Compare two DAGs or EGs
Description
Function that returns multiple graph metrics to compare two DAGs or essential graphs, known as confusion matrix or error matrix.
Usage
compareEG(ref, test)
Arguments
ref |
a matrix or a formula statement (see details for format) defining
the reference network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
test |
a matrix or a formula statement (see details for format) defining
the test network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
Details
This R function returns standard Directed Acyclic Graph comparison metrics. In statistical classification, those metrics are known as a confusion matrix or error matrix.
Those metrics allows visualization of the difference between different DAGs. In the case where comparing TRUTH to learned structure or two learned structures, those metrics allow the user to estimate the performance of the learning algorithm. In order to compute the metrics, a contingency table is computed of a pondered difference of the adjacency matrices od the two graphs.
The ref
or test
can be provided using a formula statement
(similar to GLM input).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separtor between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
To test for essential graphs (or graphs) in general, the test for DAG
need to be switched off checkDAG=FALSE
.
The function compareEG()
is a wrapper to compareDag(, checkDAG=FALSE)
.
Value
TP
True Positive
TN
True Negative
FP
False Positive
FN
False Negative
CP
Condition Positive (ref)
CN
Condition Negative (ref)
PCP
Predicted Condition Positive (test)
PCN
Predicted Condition Negative (test)
True Positive Rate
=\frac{\sum TP}{\sum CP}
False Positive Rate
=\frac{\sum FP}{\sum CN}
Accuracy
=\frac{\sum TP + \sum TN}{Total population}
G-measure
\sqrt {{\frac {TP}{TP+FP}}\cdot {\frac {TP}{TP+FN}}}
F1-Score
\frac{2 \sum TP}{2 \sum TP + \sum FN + \sum FP}
Positive Predictive Value
\frac{\sum TP}{\sum PCP}
False Ommision Rate
\frac{\sum FN}{\sum PCN}
Hamming-Distance
Number of changes needed to match the matrices.
References
Sammut, Claude, and Geoffrey I. Webb. (2017). Encyclopedia of machine learning and data mining. Springer.
Examples
test.m <- matrix(data = c(0,1,0,
0,0,0,
1,0,0), nrow = 3, ncol = 3)
ref.m <- matrix(data = c(0,0,0,
1,0,0,
1,0,0), nrow = 3, ncol = 3)
colnames(test.m) <- rownames(test.m) <- colnames(ref.m) <- rownames(ref.m) <- c("a", "b", "c")
unlist(compareDag(ref = ref.m, test = test.m))
Make DAG of class "abnDag"
Description
Make DAG of class "abnDag"
Usage
createAbnDag( dag, data.df = NULL, data.dists = NULL, ...)
Arguments
dag |
a matrix or a formula specifying a DAG, see ‘Details’. |
data.df |
named dataframe. |
data.dists |
named list giving the distribution for each node in the network. If not provided it will be sample and returned. |
... |
further arguments passed to or from other methods. |
Details
An object of class class(abnDag)
contains a named matrix describing the DAG and possibly additional objects such as the associated distributions of the nodes.
If the dag is specified with a formula, either data.df
or data.dists
is required with the .
quantifier.
If the dag is specified with an unnamed matrix and both data.df
and data.dists
are missing, lower-case letters of the Roman alphabet are used for the node names.
Value
abnDag object as list of dag, data.df, data.dists.
Create a legitimate DAGs
Create a legitimate DAG in the abn format.
An object of class abnDag
containing a named matrix and a named list giving the distribution for each node.
Examples
dagFromFormula <- createAbnDag(dag = ~a+b|a,
data.df = data.frame("a"=1, "b"=1),
data.dists = list(a="binomial", b="gaussian"))
dagFromMatrix <- createAbnDag(dag = matrix(c(0,1,0,0), 2, 2),
data.df = data.frame("a"=1, "b"=1),
data.dists = list(a="binomial", b="gaussian"))
plot(dagFromMatrix)
Discretization of a Possibly Continuous Data Frame of Random Variables based on their distribution
Description
This function discretizes a data frame of possibly continuous random variables through rules for discretization. The discretization algorithms are unsupervised and univariate. See details for the complete list of discretization rules (the number of state of each random variable could also be provided).
Usage
discretization(data.df = NULL,
data.dists = NULL,
discretization.method = "sturges",
nb.states = FALSE)
Arguments
data.df |
a data frame containing the data to discretize, binary and multinomial variables must be declared as factors, others as a numeric vector. The data frame must be named. |
data.dists |
a named list giving the distribution for each node in the network. |
discretization.method |
a character vector giving the discretization method to use; see details. If a number is provided, the variable will be discretized by equal binning. |
nb.states |
logical variable to select the output. If set to |
Details
fd
Freedman Diaconis rule. IQR()
stands for interquartile range.
The number of bins is given by
\frac{range(x) * n^{1/3}}{2 * IQR(x)}
The Freedman Diaconis rule is known to be less sensitive than the Scott's rule to outlier.
doane
Doane's rule.
The number of bins is given by
1 + \log_{2}{n} + \log_{2}{1+\frac{|g|}{\sigma_{g}}}
This is a modification of Sturges' formula, which attempts to improve its performance with non-normal data.
sqrt
The number of bins is given by:
\sqrt(n)
cencov
Cencov's rule.
The number of bins is given by:
n^{1/3}
rice
Rice' rule.
The number of bins is given by:
2 n^{1/3}
terrell-scott
Terrell-Scott's rule.
The number of bins is given by:
(2 n)^{1/3}
It is known that Cencov, Rice, and Terrell-Scott rules over-estimates k, compared to other rules due to its simplicity.
sturges
Sturges's rule.
The number of bins is given by:
1 + \log_2(n)
scott
Scott's rule.
The number of bins is given by:
range(x) / \sigma(x) n^{-1/3}
Value
The discretized data frame or a list containing the table of counts for each bin the discretized data frame.
table of counts for each bin of the discretized data frame.
References
Garcia, S., et al. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25.4, 734-750.
Cebeci, Z. and Yildiz, F. (2017). Unsupervised Discretization of Continuous Variables in a Chicken Egg Quality Traits Dataset. Turkish Journal of Agriculture-Food Science and Technology, 5.4, 315-320.
Examples
## Generate random variable
rv <- rnorm(n = 100, mean = 5, sd = 2)
dist <- list("gaussian")
names(dist) <- c("rv")
## Compute the entropy through discretization
entropyData(freqs.table = discretization(data.df = rv, data.dists = dist,
discretization.method = "sturges", nb.states = FALSE))
Computes an Empirical Estimation of the Entropy from a Table of Counts
Description
This function empirically estimates the Shannon entropy from a table of counts using the observed frequencies.
Usage
entropyData(freqs.table)
Arguments
freqs.table |
a table of counts. |
Details
The general concept of entropy is defined for probability distributions.
The entropyData()
function estimates empirical entropy from data.
The probability is estimated from data using frequency tables.
Then the estimates are plug-in in the definition of the entropy to return
the so-called empirical entropy. A common known problem of empirical entropy
is that the estimations are biased due to the sampling noise.
It is also known that the bias will decrease as the sample size increases.
Value
Shannon's entropy estimate on natural logarithm scale.
integer
References
Cover, Thomas M, and Joy A Thomas. (2012). "Elements of Information Theory". John Wiley & Sons.
See Also
Examples
## Generate random variable
rv <- rnorm(n = 100, mean = 5, sd = 2)
dist <- list("gaussian")
names(dist) <- c("rv")
## Compute the entropy through discretization
entropyData(freqs.table = discretization(data.df = rv, data.dists = dist,
discretization.method = "sturges", nb.states = FALSE))
Construct the essential graph
Description
Constructs different versions of the essential graph from a given DAG. External function that computes essential graph of a dag Minimal PDAG: The only directed edges are those who participate in v-structure Completed PDAG: very directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge
Usage
essentialGraph(dag, node.names = NULL, PDAG = "minimal")
Arguments
dag |
a matrix or a formula statement (see ‘Details’ for format) defining the network structure, a directed acyclic graph (DAG). |
node.names |
a vector of names if the DAG is given via formula, see ‘Details’. |
PDAG |
a character value that can be: minimal or complete, see ‘Details’. |
Details
This function returns an essential graph from a DAG, aka acyclic partially directed graph (PDAG). This can be useful if the learning procedure is defined up to a Markov class of equivalence. A minimal PDAG is defined as only directed edges are those who participate in v-structure. Whereas the completed PDAG: every directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge.
The dag
can be provided using a formula statement (similar to glm).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separator between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
Value
A matrix giving the PDAG.
References
West, D. B. (2001). Introduction to Graph Theory. Vol. 2. Upper Saddle River: Prentice Hall. Chickering, D. M. (2013) A Transformational Characterization of Equivalent Bayesian Network Structures, arXiv:1302.4938.
Examples
dag <- matrix(c(0,0,0, 1,0,0, 1,1,0), nrow = 3, ncol = 3)
dist <- list(a="gaussian", b="gaussian", c="gaussian")
colnames(dag) <- rownames(dag) <- names(dist)
essentialGraph(dag)
function to get marginal across an equal grid
Description
function to get marginal across an equal grid
Usage
eval.across.grid(mylist, n.grid, single)
Arguments
mylist |
list of matrices of two cols x, y |
n.grid |
grid size |
single |
NULL or TRUE if only a single node and parameter |
Value
list
Synthetic validation data set for use with abn library examples
Description
300 observations simulated from a DAG with 10 binary variables, 10 Gaussian variables and 10 poisson variables.
Usage
ex0.dag.data
Format
A data frame, binary variables are factors. The relevant formulas are given below (note these do not give parameter estimates just the form of the relationships, e.g. logit()=1 means a logit link function and comprises of only an intercept term).
- b1
binary, logit()=1
- b2
binary, logit()=1
- b3
binary, logit()=1
- b4
binary, logit()=1
- b5
binary, logit()=1
- b6
binary, logit()=1
- b7
binary, logit()=1
- b8
binary, logit()=1
- b9
binary, logit()=1
- b10
binary, logit()=1
- g1
gaussian, identity()=1
- g2
gaussian, identity()=1
- g3
gaussian, identity()=1
- g4
gaussian, identity()=1
- g5
gaussian, identity()=1
- g6
gaussian, identity()=1
- g7
gaussian, identity()=1
- g8
gaussian, identity()=1
- g9
gaussian, identity()=1
- g10
gaussian, identity()=1
- p1
poisson, log()=1
- p2
poisson, log()=1
- p3
poisson, log()=1
- p4
poisson, log()=1
- p5
poisson, log()=1
- p6
poisson, log()=1
- p7
poisson, log()=1
- p8
poisson, log()=1
- p9
poisson, log()=1
- p10
poisson, log()=1
Examples
## Not run:
## The dataset was (essentially) generated using the following code:
datasize <- 300
tmp <- c(rep("y", as.integer(datasize/2)), rep("n", as.integer(datasize/2)))
set.seed(1)
ex0.dag.data <- data.frame(b1=sample(tmp, size=datasize, replace=TRUE),
b2=sample(tmp, size=datasize, replace=TRUE),
b3=sample(tmp, size=datasize, replace=TRUE),
b4=sample(tmp, size=datasize, replace=TRUE),
b5=sample(tmp, size=datasize, replace=TRUE),
b6=sample(tmp, size=datasize, replace=TRUE),
b7=sample(tmp, size=datasize, replace=TRUE),
b8=sample(tmp, size=datasize, replace=TRUE),
b9=sample(tmp, size=datasize, replace=TRUE),
b10=sample(tmp, size=datasize, replace=TRUE),
g1=rnorm(datasize, mean=0,sd=1),
g2=rnorm(datasize, mean=0,sd=1),
g3=rnorm(datasize, mean=0,sd=1),
g4=rnorm(datasize, mean=0,sd=1),
g5=rnorm(datasize, mean=0,sd=1),
g6=rnorm(datasize, mean=0,sd=1),
g7=rnorm(datasize, mean=0,sd=1),
g8=rnorm(datasize, mean=0,sd=1),
g9=rnorm(datasize, mean=0,sd=1),
g10=rnorm(datasize, mean=0,sd=1),
p1=rpois(datasize, lambda=10),
p2=rpois(datasize, lambda=10),
p3=rpois(datasize, lambda=10),
p4=rpois(datasize, lambda=10),
p5=rpois(datasize, lambda=10),
p6=rpois(datasize, lambda=10),
p7=rpois(datasize, lambda=10),
p8=rpois(datasize, lambda=10),
p9=rpois(datasize, lambda=10),
p10=rpois(datasize, lambda=10))
## End(Not run)
Synthetic validation data set for use with abn library examples
Description
10000 observations simulated from a DAG with 10 variables from Poisson, Bernoulli and Gaussian distributions.
Usage
ex1.dag.data
Format
A data frame, binary variables are factors. The relevant formulas are given below (note these do not give parameter estimates just the form of the relationships, like in glm(), e.g. logit()=1+p1 means a logit link function and comprises of an intercept term and a term involving p1).
- b1
binary, logit()=1
- p1
poisson, log()=1
- g1
gaussian, identity()=1
- b2
binary, logit()=1
- p2
poisson, log()=1+b1+p1
- b3
binary, logit()=1+b1+g1+b2
- g2
gaussian, identify()=1+p1+g1+b2
- b4
binary, logit()=1+g1+p2
- b5
binary, logit()=1+g1+g2
- g3
gaussian, identity()=1+g1+b2
Examples
## The data is one realisation from the the underlying DAG:
ex1.true.dag <- matrix(data=c(
0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,
1,1,0,0,0,0,0,0,0,0,
1,0,1,1,0,0,0,0,0,0,
0,1,1,1,0,0,0,0,0,0,
0,0,1,0,1,0,0,0,0,0,
0,0,1,0,0,0,1,0,0,0,
0,0,1,1,0,0,0,0,0,0), ncol=10, byrow=TRUE)
colnames(ex1.true.dag) <- rownames(ex1.true.dag) <-
c("b1","p1","g1","b2","p2","b3","g2","b4","b5","g3")
Synthetic validation data set for use with abn library examples
Description
10000 observations simulated from a DAG with 18 variables three sets each from Poisson, Bernoulli and Gaussian distributions.
Usage
ex2.dag.data
Format
A data frame, binary variables are factors. The relevant formulas are given below (note these do not give parameter estimates just the form of the relationships, e.g. logit()=1 means a logit link function and comprises of only an intercept term).
- b1
binary,logit()=1+g1+b2+b3+p3+b4+g4+b5
- g1
gaussian,identity()=1
- p1
poisson,log()=1+g6
- b2
binary,logit()=1+p3+b4+p6
- g2
gaussian,identify()=1+b2
- p2
poisson,log()=1+b2
- b3
binary,logit()=1+g1+g2+p2+g3+p3+g4
- g3
gaussian,identify()=1+g1+p3+b4
- p3
poisson,log()=1
- b4
binary,logit()=1+g1+p3+p5
- g4
gaussian,identify()=1+b4;
- p4
poisson,log()=1+g1+b2+g2+b5
- b5
binary,logit()=1+b2+g2+b3+p3+g4
- g5
gaussian,identify()=1
- p5
poisson,log()=1+g1+g5+b6+g6
- b6
binary,logit()=1
- g6
gaussian,identify()=1
- p6
poisson,log()=1+g5
Examples
## The true underlying stochastic model has DAG - this data is a single realisation.
ex2.true.dag <- matrix(data = c(
0,1,0,1,0,0,1,0,1,1,1,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,1,1,0,1,1,0,1,0,0,0,0,0,0,0,
0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
), ncol = 18, byrow = TRUE)
colnames(ex2.true.dag) <- rownames(ex2.true.dag) <- c("b1","g1","p1","b2",
"g2","p2","b3","g3",
"p3","b4","g4","p4",
"b5","g5","p5","b6",
"g6","p6")
Validation data set for use with abn library examples
Description
1000 observations across with 13 binary variables and one grouping variable. Real (anonymised) data of unknown structure.
Usage
ex3.dag.data
Format
A data frame with 14 columns, where
b1,b2,...,b13
are binary variables encoded as factors and
group
is a factor with 100 factors defining the sampling
groups (10 observations each).
Valdiation data set for use with abn library examples
Description
2000 observations across with 10 binary variables and one grouping variable. Real (anonymised) data of unknown structure.
Usage
ex4.dag.data
Format
A data frame with eleven columns:
group
factor with 85 levels defining sampling groups;
b1,...,b10
binary variables encoded as factors.
Valdiation data set for use with abn library examples
Description
434 observations across with 18 variables, 6 binary and 12 continuous, and one grouping variable. Real (anonymised) data of unknown structure.
Usage
ex5.dag.data
Format
A data frame with 19 columns: b1,...,b6
binary
variables, encoded as factors;
g1,...,g12
continuous variables. Finally, the column
group
defines sampling groups (encoded as a factor as well).
Valdiation data set for use with abn library examples
Description
800 observations across with 8 variables, 1 count, 2 binary and 4 continuous, and 1 grouping variable. Real (anonymised) data of unknown structure.
Usage
ex6.dag.data
Format
A data frame with eight columns. Binary variables are factors
- p1
count
- g1
continuous
- g2
continuous
- b1
binary
- b2
binary
- g3
continuous
- g4
continuous
- group
factor,defines sampling groups
Valdiation data set for use with abn library examples
Description
10648 observations across with 3 variables, 2 binary and 1 grouping variable. Real (anonymised) data of unknown structure.
Usage
ex7.dag.data
Format
A data frame, binary variables are factors
- b1
binary
- b2
binary
- group
factor, defines sampling groups
expit of proportions
Description
See also the C implementation ?abn::expit_cpp()
.
Usage
expit(x)
Arguments
x |
numeric with values between |
Value
numeric vector of same length as x
.
expit function
Description
transform x
either via the logit, or expit.
Usage
expit_cpp(x)
Arguments
x |
a numeric vector |
Value
a numeric vector
Factorial
Description
Calculate the factorial in C##
Usage
factorial(n)
Value
a double
Fast Factorial
Description
Calculate the factorial in C##
Usage
factorial_fast(n)
Value
a double
Print family of objects of class abnFit
Description
Print family of objects of class abnFit
Usage
## S3 method for class 'abnFit'
family(object, ...)
Arguments
object |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
prints the distributions for each variable of the fitted model.
Find next X evaluation Point
Description
Attempt to find the next x evaluation point using spline extrapolation traversing left from mode.
Usage
find.next.left.x(mat.xy, g.max, g.factor, x.delta, max.fact.delta)
find.next.right.x(mat.xy, g.max, g.factor, x.delta, max.fact.delta)
Arguments
mat.xy |
matrix |
g.max |
integer. See Details. |
g.factor |
integer. See Details. |
x.delta |
integer. See Details. |
max.fact.delta |
integer. See Details. |
Details
if new x point is more than a factor max.fact.delta (e.g. 0.2) from last evaluated point then stop here
cat("evaluating node ",nodeid,": parameter:",paramid," at betafixed=",betafixed," with gvalue=",gvalue,"\n",sep="");
find the next x value left which differs from the max. gvalue by at least a factor of g.factor, searching in step sizes of
x.delta subject to the constraint that if we move more than max.fact.delta*last.x then we evaluate here. Avoids big steps.
Value
integer
integer
Functions
-
find.next.right.x()
: Attempt to find the next x evaluation point using spline extrapolation traversing right from mode.
Control the iterations in fitAbn
Description
Allow the user to set restrictions in the fitAbn
for both the Bayesian and the MLE approach.
Control function similar to build.control
.
Usage
fit.control(
method = "bayes",
max.mode.error = 10,
mean = 0,
prec = 0.001,
loggam.shape = 1,
loggam.inv.scale = 5e-05,
max.iters = 100,
epsabs = 1e-07,
error.verbose = FALSE,
trace = 0L,
epsabs.inner = 1e-06,
max.iters.inner = 100,
finite.step.size = 1e-07,
hessian.params = c(1e-04, 0.01),
max.iters.hessian = 10,
max.hessian.error = 1e-04,
factor.brent = 100,
maxiters.hessian.brent = 10,
num.intervals.brent = 100,
min.pdf = 0.001,
n.grid = 250,
std.area = TRUE,
marginal.quantiles = c(0.025, 0.25, 0.5, 0.75, 0.975),
max.grid.iter = 1000,
marginal.node = NULL,
marginal.param = NULL,
variate.vec = NULL,
ncores = 1,
cluster.type = "FORK",
max.irls = 100,
tol = 1e-11,
tolPwrss = 1e-07,
check.rankX = "message+drop.cols",
check.scaleX = "message+rescale",
check.conv.grad = "message",
check.conv.singular = "message",
check.conv.hess = "message",
xtol_abs = 1e-06,
ftol_abs = 1e-06,
trace.mblogit = FALSE,
catcov.mblogit = "free",
epsilon = 1e-06,
seed = 9062019L
)
Arguments
method |
a character that takes one of two values: "bayes" or "mle". Overrides |
max.mode.error |
if the estimated modes from INLA differ by a factor of |
mean |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec |
the prior precision ( |
loggam.shape |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
max.iters |
total number of iterations allowed when estimating the modes in Laplace approximation. passed to |
epsabs |
absolute error when estimating the modes in Laplace approximation for models with no random effects. Passed to |
error.verbose |
logical, additional output in the case of errors occurring in the optimization. Passed to |
trace |
Non-negative integer. If positive, tracing information on the progress of the "L-BFGS-B" optimization is produced. Higher values may produce more tracing information. (There are six levels of tracing. To understand exactly what these do see the source code.). Passed to |
epsabs.inner |
absolute error in the maximization step in the (nested) Laplace approximation for each random effect term. Passed to |
max.iters.inner |
total number of iterations in the maximization step in the nested Laplace approximation. Passed to |
finite.step.size |
suggested step length used in finite difference estimation of the derivatives for the (outer) Laplace approximation when estimating modes. Passed to |
hessian.params |
a numeric vector giving parameters for the adaptive algorithm, which determines the optimal stepsize in the finite-difference estimation of the hessian. First entry is the initial guess, second entry absolute error. Passed to |
max.iters.hessian |
integer, maximum number of iterations to use when determining an optimal finite difference approximation (Nelder-Mead). Passed to |
max.hessian.error |
if the estimated log marginal likelihood when using an adaptive 5pt finite-difference rule for the Hessian differs by more than |
factor.brent |
if using Brent-Dekker root bracketing method then define the outer most interval end points as the best estimate of |
maxiters.hessian.brent |
maximum number of iterations allowed in the Brent-Dekker method. Passed to |
num.intervals.brent |
the number of initial different bracket segments to try in the Brent-Dekker method. Passed to |
min.pdf |
the value of the posterior density function below which we stop the estimation only used when computing marginals, see details. |
n.grid |
recompute density on an equally spaced grid with |
std.area |
logical, should the area under the estimated posterior density be standardized to exactly one, useful for error checking. |
marginal.quantiles |
vector giving quantiles at which to compute the posterior marginal distribution at. |
max.grid.iter |
gives number of grid points to estimate posterior density at when not explicitly specifying a grid used to avoid excessively long computation. |
marginal.node |
used in conjunction with |
marginal.param |
used in conjunction with |
variate.vec |
a vector containing the places to evaluate the posterior marginal density, must be supplied if |
ncores |
The number of cores to parallelize to, see ‘Details’. If >0, the number of CPU cores to be used. -1 for all available -1 core. Only for |
cluster.type |
The type of cluster to be used, see |
max.irls |
total number of iterations for estimating network scores using an Iterative Reweighed Least Square algorithm. Is this DEPRECATED? |
tol |
real number giving the minimal tolerance expected to terminate the Iterative Reweighed Least Square algorithm to estimate network score. Passed to |
tolPwrss |
numeric scalar passed to |
check.rankX |
character passed to |
check.scaleX |
character passed to |
check.conv.grad |
character passed to |
check.conv.singular |
character passed to |
check.conv.hess |
character passed to |
xtol_abs |
Defaults to 1e-6 stop on small change of parameter value. Only for |
ftol_abs |
Defaults to 1e-6 stop on small change in deviance. Similar to |
trace.mblogit |
logical indicating if output should be produced for each iteration. Directly passed to |
catcov.mblogit |
Defaults to "free" meaning that there are no restrictions on the covariances of random effects between the logit equations. Set to "diagonal" if random effects pertinent to different categories are uncorrelated or "single" if random effect variances pertinent to all categories are identical. |
epsilon |
Defaults to 1e-8. Positive convergence tolerance |
seed |
a non-negative integer which sets the seed in |
Details
Parallelization over all children is possible via the function foreach
of the package doParallel. ncores=0
or ncores=1
use single threaded foreach
. ncores=-1
uses all available cores but one.
Value
a list of control parameters for the fitAbn
function.
See Also
Other fitAbn:
fitAbn()
Examples
ctrlmle <- abn::fit.control(method = "mle",
max.irls = 100,
tol = 10^-11,
tolPwrss = 1e-7,
xtol_abs = 1e-6,
ftol_abs = 1e-6,
epsilon = 1e-6,
ncores = 2,
cluster.type = "PSOCK",
seed = 9062019L)
ctrlbayes <- abn::fit.control(method = "bayes",
mean = 0,
prec = 0.001,
loggam.shape = 1,
loggam.inv.scale = 5e-05,
max.mode.error = 10,
max.iters = 100,
epsabs = 1e-07,
error.verbose = FALSE,
epsabs.inner = 1e-06,
max.iters.inner = 100,
finite.step.size = 1e-07,
hessian.params = c(1e-04, 0.01),
max.iters.hessian = 10,
max.hessian.error = 1e-04,
factor.brent = 100,
maxiters.hessian.brent = 10,
num.intervals.brent = 100,
min.pdf = 0.001,
n.grid = 100,
std.area = TRUE,
marginal.quantiles = c(0.025, 0.25, 0.5, 0.75, 0.975),
max.grid.iter = 1000,
marginal.node = NULL,
marginal.param = NULL,
variate.vec = NULL,
ncores = 1,
cluster.type = NULL,
seed = 9062019L)
Fit an additive Bayesian network model
Description
Fits an additive Bayesian network to observed data and is equivalent to Bayesian or information-theoretic multi-dimensional regression modeling. Two numerical options are available in the Bayesian settings, standard Laplace approximation or else an integrated nested Laplace approximation provided via a call to the R INLA library (see r-inla.org - this is not hosted on CRAN).
Usage
fitAbn(object = NULL,
dag = NULL,
data.df = NULL,
data.dists = NULL,
method = NULL,
group.var = NULL,
adj.vars = NULL,
cor.vars = NULL,
centre = TRUE,
compute.fixed = FALSE,
control = NULL,
verbose = FALSE,
debugging = FALSE,
...)
fitAbn.bayes(
dag = NULL,
data.df = NULL,
data.dists = NULL,
group.var = NULL,
cor.vars = NULL,
centre = TRUE,
compute.fixed = FALSE,
control = fit.control(method = "bayes"),
mylist = NULL,
grouped.vars = NULL,
group.ids = NULL,
force.method = NULL,
verbose = FALSE,
debugging = FALSE
)
fitAbn.mle(
dag = NULL,
data.df = NULL,
data.dists = NULL,
group.var = NULL,
grouped.vars = NULL,
group.ids = NULL,
adj.vars = NULL,
cor.vars = NULL,
centre = TRUE,
control = fit.control(method = "mle"),
verbose = FALSE,
debugging = FALSE
)
regressionLoop(
i = NULL,
dag = NULL,
data.df = NULL,
data.df.multi = NULL,
data.dists = NULL,
group.var = NULL,
grouped.vars = NULL,
group.ids = NULL,
control = NULL,
nvars = NULL,
nobs = NULL,
dag.multi = NULL,
verbose = NULL
)
Arguments
object |
an object of class |
dag |
a matrix or a formula statement (see details) defining the network structure, a directed acyclic graph (DAG), see details for format. Note that column names and row names must be set up. |
data.df |
a data frame containing the data used for learning the network, binary variables must be declared as factors, and no missing values all allowed in any variable. |
data.dists |
a named list giving the distribution for each node in the network, see details. |
method |
if |
group.var |
only applicable for mixed models and gives the column name in |
adj.vars |
a character vector giving the column names in |
cor.vars |
a character vector giving the column names in data.df for which a mixed model should be used ( |
centre |
should the observations in each Gaussian node first be standardised to mean zero and standard deviation one. |
compute.fixed |
a logical flag, set to |
control |
a list of control parameters. See |
verbose |
if |
debugging |
if |
... |
additional arguments passed for optimization. |
mylist |
result returned from |
grouped.vars |
result returned from |
group.ids |
result returned from |
force.method |
"notset", "INLA" or "C". This is specified in |
i |
number of child-node (mostly corresponds to child node index e.g. in dag). |
data.df.multi |
extended data.df for one-hot-encoded multinomial variables. |
nvars |
number of variables in data.dists. |
nobs |
number of observations in data.df. |
dag.multi |
extended dag for one-hot-encoded multinomial variables. |
Details
If method="Bayes"
:
The procedure fitAbn
fits an additive Bayesian network model to data where each node (variable - a column in data.df) can be either: presence/absence (Bernoulli); continuous (Gaussian); or an unbounded count (Poisson). Multinomial distributions are only supported with method = "mle"
(see below).
The model comprises of a set of conditionally independent generalized linear regressions with or without random effects.
Internal code is used by default for numerical estimation in nodes without random effects, and INLA is the default for nodes with random effects.
This default behavior can be overridden using control=list(max.mode.error=...)
. The default is max.mode.error=10
, which means that the modes estimated from INLA output must be within 10\
Otherwise, the internal code is used rather than INLA.
To force the use of INLA on all nodes, use max.mode.error=100
, which then ignores this check, to force the use of internal code then use max.mode.error=0
.
For the numerical reliability and perform of abn see https://r-bayesian-networks.org/.
Generally speaking, INLA can be swift and accurate, but in several cases, it can perform very poorly and so some care is required (which is why there is an internal check on the modes).
Binary variables must be declared as factors with two levels, and the argument data.dists
must be a list with named arguments, one for each of the variables in data.df
(except a grouping variable - if present!), where each entry is either "poisson","binomial", "multinomial" or "gaussian", see examples below.
The "poisson" and "binomial" distributions use log and logit link functions, respectively.
Note that "binomial" here actually means only binary, one Bernoulli trial per row in data.df
.
If the data are grouped into correlated blocks - wherein a standard regression context a mixed model might be used - then a network comprising of one or more nodes where a generalized linear mixed model is used (but limited to only a single random effect).
This is achieved by specifying parameters group.var
and cor.vars
.
Where the former defines the group membership variable, which should be a factor indicating which observations belong to the same grouping.
The parameter cor.vars
is a character vector that contains the names of the nodes for which a mixed model should be used. This is not yet implemented with method = 'mle'
.
For example, in some problems, it may be appropriate for all variables (except group.var
) in data.df to be parametrized as a mixed model while in others it may only be a single variable for which grouping adjustment is required (as the remainder of variables are covariates measured at group level).
In the network structure definition, dag
, each row represents a node in the network, and the columns in each row define the parents for that particular node, see the example below for the specific format.
The dag
can be provided using a formula statement (similar to GLM).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
. In this example, node1 has two parents (parent1 and parent2). node2 and node3 have the same parent3.
The parents names must match those given in data.df
. :
is the separator between either children or parents, |
separates children (left side) and parents (right side), +
separates terms, .
replaces all the variables in data.df
.
If compute.fixed=TRUE
then the marginal posterior distributions for all parameters are computed.
Note the current algorithm used to determine the evaluation grid is rather crude and may need to be manually refined using variate.vec
(one parameter at a time) for publication-quality density estimates.
Note that a manual grid can only be used with internal code and not INLA (which uses its own grid).
The end points are defined as where the value of the marginal density drops below a given threshold pdf.min
.
When estimating the log marginal likelihood in models with random effects (using internal code rather than INLA), an attempt is made to minimize the error by comparing the estimates given between a 3pt and 5pt rule when estimating the Hessian in the Laplace approximation.
The modes used in each case are identical. The first derivatives are computed using gsl's adaptive finite difference function, and this is embedding inside the standard 3pt and 5pt rules for the second derivatives.
In all cases, a central difference approximation is tried first with a forward difference being a fall back (as the precision parameters are strictly positive).
The error is minimized through choosing an optimal step size using gsl's Nelder-Mead optimization, and if this fails, (e.g., is larger than max.hessian.error
) then the Brent-Dekker root bracketing method is used as a fallback.
If the error cannot be reduced to below max.hessian.error
, then the step size, which gave the lowest error during the searches (across potentially many different initial bracket choices), is used for the final Hessian evaluations in the Laplace approximation.
If method="mle"
:
The procedure fitAbn
with the argument method= "mle"
fits an additive Bayesian network model to data where each node (variable - a column in data.df) can be either: presence/absence (Bernoulli); continuous (Gaussian); an unbounded count (Poisson); or a discrete variable (Multinomial).
The model comprises of a set of conditionally independent generalized linear regressions with or without adjustment.
Binary and discrete variables must be declared as factors and the argument data.dists
must be a list with named arguments, one for each of the variables in data.df
, where each entry is either "poisson","binomial", "multinomial" or "gaussian", see examples below.
The "poisson" and "binomial" distributions use log and logit link functions, respectively.
Note that "binomial" here actually means only binary, one Bernoulli trial per row in data.df.
If the data are grouped into correlated blocks - wherein a standard regression context a mixed-effect model might be used - then a network comprising of one or more nodes where a generalized linear mixed model is used (but limited to only a single random intercept).
This is achieved by specifying parameter group.var
(cor.vars
as with method = "bayes"
is not yet implemented with method = "mle"
).
The parameter group.var
defines the group membership variable, which should be a factor indicating which observations belong to the same grouping.
This corresponds to "1|group.var"
in the formula notation of e.g. lme4.
In the context of fitAbn
adjustment means that irrespective to the adjacency matrix the adjustment variable set (adj.vars
) will be add as covariate to every node defined by cor.vars
.
In the network structure definition, dag
, each row represents a node in the network, and the columns in each row define the parents for that particular node, see the example below for the specific format.
The dag
can be provided using a formula statement (similar to GLM). A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
. In this example, node1 has two parents (parent1 and parent2). node2 and node3 have the same parent3.
The parents names have to exactly match those given in data.df
. :
is the separator between either children or parents, |
separates children (left side) and parents (right side), +
separates terms, .
replaces all the variables in data.df
.
The Information-theoretic based network scores used in fitAbn
with argument method="mle"
are the maximum likelihood (mlik, called marginal likelihood in this context as it is computed node wise), the Akaike Information Criteria (aic), the Bayesian Information Criteria (bic) and the Minimum distance Length (mdl). The classical definitions of those metrics are given in Kratzer and Furrer (2018).
The numerical routine is based on an iterative scheme to estimate the regression coefficients. The Iterative Reweighed Least Square (IRLS) programmed using Rcpp/RcppArmadrillo. One hard coded feature of fitAbn
with argument method="mle"
is a conditional use of a bias reduced binomial regression when a classical Generalized Linear Model (GLM) fails to estimate the maximum likelihood of the given model accurately. Additionally, a QR decomposition is performed to check for rank deficiency. If the model is rank deficient and the BR GLM fails to estimate it, then predictors are sequentially removed. This feature aims at better estimating network scores when data sparsity is present.
A special care should be taken when interpreting or even displaying p-values computed with fitAbn
. Indeed, the full model is already selected using goodness of fit metrics based on the (same) full dataset.
The control
argument is a list with separate arguments for the Bayesian and MLE implementation. See fit.control
for details.
Value
An object of class abnFit
. A named list. One entry for each of the variables in data.df
(excluding the grouping variable, if present) which contains an estimate of the log marginal likelihood for that individual node. An entry "mlik" which is the total log marginal likelihood for the full ABN model. A vector of error.codes
- non-zero if a numerical error or warning occurred, and a vector error.code.desc giving a text description of the error. A list modes
, which contains all the mode estimates for each parameter at each node. A vector called Hessian accuracy, which is the estimated local error in the log marginal likelihood for each node. If compute.fixed=TRUE
then a list entry called marginals
which contains a named entry for every parameter in the ABN and each entry in this list is a two-column matrix where the first column is the value of the marginal parameter, say x, and the second column is the respective density value, pdf(x). Also, a list called marginal.quantiles
is produced, giving the quantiles for each marginal posterior distribution.
list
Functions
-
fitAbn.bayes()
: Internal function called byfitAbn
. -
fitAbn.mle()
: Internal function called byfitAbn
. -
regressionLoop()
: Internal function called byfitAbn.mle()
.
Author(s)
Fraser Iain Lewis and Gilles Kratzer.
References
Kratzer, G., Lewis, F.I., Comin, A., Pittavino, M. and Furrer, R. (2019). "Additive Bayesian Network Modelling with the R Package abn". arXiv preprint arXiv:1911.09006.
Kratzer, G., and Furrer, R., 2018. Information-Theoretic Scoring Rules to Learn Additive Bayesian Network Applied to Epidemiology. Preprint; Arxiv: stat.ML/1808.01126.
Lewis, F. I., and McCormick, B. J. J. (2012). Revealing the complexity of health determinants in resource poor settings. American Journal Of Epidemiology. DOI:10.1093/aje/KWS183.
Further information about abn can be found at: r-bayesian-networks.org.
See Also
Other fitAbn:
fit.control()
Other Bayes:
buildScoreCache()
,
calc.node.inla.glm()
,
calc.node.inla.glmm()
,
getmarginals()
Examples
## Not run:
## Built-in dataset with a subset of cols
mydat <- ex0.dag.data[, c("b1", "b2", "b3", "g1", "b4", "p2", "p4")]
## setup distribution list for each node
mydists <- list(b1 = "binomial",
b2 = "binomial",
b3 = "binomial",
g1 = "gaussian",
b4 = "binomial",
p2 = "poisson",
p4 = "poisson")
## Null model - all independent variables
mydag_empty <- matrix(0, nrow = 7, ncol = 7)
colnames(mydag_empty) <- rownames(mydag_empty) <- names(mydat)
## Now fit the model to calculate its goodness-of-fit
myres <- fitAbn(dag = mydag_empty,
data.df = mydat,
data.dists = mydists)
## Log-marginal likelihood goodness-of-fit for complete DAG
print(myres$mlik)
## fitAbn accepts also the formula statement
myres <- fitAbn(dag = ~ b1 | b2 + b2 | p4:g1 + g1 | p2 + b3 | g1 + b4 | b1 + p4 | g1,
data.df = mydat,
data.dists = mydists)
print(myres$mlik) # a much weaker fit than full independence DAG
# Plot the DAG via Rgraphviz
plot(myres)
## Or equivalently using the formula statement, with plotting
## Now repeat but include some dependencies first
mydag <- mydag_empty
mydag["b1", "b2"] <- 1 # b1<-b2 and so on
mydag["b2", "p4"] <- mydag["b2", "g1"] <- mydag["g1", "p2"] <- 1
mydag["b3", "g1"] <- mydag["b4", "b1"] <- mydag["p4", "g1"] <- 1
myres_alt <- fitAbn(dag = mydag,
data.df = mydat,
data.dists = mydists)
plot(myres_alt)
## -----------------------------------------------------------------------------
## This function contains an MLE implementation accessible through a method
## parameter use built-in simulated data set
## -----------------------------------------------------------------------------
myres_mle <- fitAbn(dag = ~ b1 | b2 + b2 | p4 + g1 + g1 | p2 + b3 | g1 + b4 | b1 + p4 | g1,
data.df = mydat,
data.dists = mydists,
method = "mle")
## Print the output for mle first then for Bayes:
print(myres_mle)
plot(myres_mle)
print(myres)
plot(myres)
## This is a basic plot of some posterior densities. The algorithm used for
## selecting density points is quite straightforward, but it might result
## in a sparse distribution. Therefore, we also recompute the density over
## an evenly spaced grid of 50 points between the two endpoints that had
## a minimum PDF at f = min.pdf.
## Setting max.mode.error = 0 forces the use of the internal C code.
myres_c <- fitAbn(dag = mydag,
data.df = mydat,
data.dists = mydists,
compute.fixed = TRUE,
control = list(max.mode.error = 0))
print(names(myres_c$marginals)) # gives all the different parameter names
## Repeat but use INLA for the numerics using max.mode.error = 100
## as using internal code is the default here rather than INLA
myres_inla <- fitAbn(dag = mydag,
data.df = mydat,
data.dists = mydists,
compute.fixed = TRUE,
control = list(max.mode.error = 100))
## Plot posterior densities
default_par <- par(no.readonly = TRUE) # save default par settings
par(mfrow = c(2, 2), mai = c(.7, .7, .2, .1))
plot(myres_c$marginals$b1[["b1 | (Intercept)"]], type = "l", xlab = "b1 | (Intercept)")
lines(myres_inla$marginals$b1[["b1 | (Intercept)"]], col = "blue")
plot(myres_c$marginals$b2[["b2 | p4"]], type = "l", xlab = "b2 | p4")
lines(myres_inla$marginals$b2[["b2 | p4"]], col = "blue")
plot(myres_c$marginals$g1[["g1 | precision"]], type = "l", xlab = "g1 | precision")
lines(myres_inla$marginals$g1[["g1 | precision"]], col = "blue")
plot(myres_c$marginals$b4[["b4 | b1"]], type = "l", xlab = "b4 | b1")
lines(myres_inla$marginals$b4[["b4 | b1"]], col = "blue")
par(default_par) # reset par settings
## An elementary mixed model example using built-in data specify DAG,
## only two variables using a subset of variables from ex3.dag.data
## both variables are assumed to need (separate) adjustment for the
## group variable, i.e., a binomial GLMM at each node
mydists <- list(b1 = "binomial",
b2 = "binomial")
## Compute marginal likelihood - use internal code via max.mode.error=0
## as using INLA is the default here.
## Model where b1 <- b2
myres_c <- fitAbn(dag = ~b1 | b2,
data.df = ex3.dag.data[, c(1, 2, 14)],
data.dists = mydists,
group.var = "group",
cor.vars = c("b1", "b2"),
control = list(max.mode.error = 0))
print(myres_c) # show all the output
## compare mode for node b1 with glmer(), lme4::glmer is automatically attached.
## Now for marginals - INLA is strongly preferable for estimating marginals for
## nodes with random effects as it is far faster, but may not be reliable
## see https://r-bayesian-networks.org/
## INLA's estimates of the marginals, using high n.grid = 500
## as this makes the plots smoother - see below.
myres_inla <- fitAbn(dag = ~b1 | b2,
data.df = ex3.dag.data[, c(1, 2, 14)],
data.dists = mydists,
group.var = "group",
cor.vars = c("b1", "b2"),
compute.fixed = TRUE,
n.grid = 500,
control = list(max.mode.error = 100,
max.hessian.error = 10E-02))
## this is NOT recommended - marginal density estimation using fitAbn in
## mixed models is really just for diagnostic purposes, better to use
## fitAbn.inla() here; but here goes... be patient
myres_c <- fitAbn(dag = ~b1 | b2,
data.df = ex3.dag.data[, c(1, 2, 14)],
data.dists = mydists,
group.var = "group",
cor.vars = c("b1", "b2"),
compute.fixed = TRUE,
control = list(max.mode.error = 0,
max.hessian.error = 10E-02))
## compare marginals between internal and INLA.
default_par <- par(no.readonly = TRUE) # save default par settings
par(mfrow = c(2, 3))
# 5 parameters - two intercepts, one slope, two group level precisions
plot(myres_inla$marginals$b1[[1]], type = "l", col = "blue")
lines(myres_c$marginals$b1[[1]], col = "brown", lwd = 2)
plot(myres_inla$marginals$b1[[2]], type = "l", col = "blue")
lines(myres_c$marginals$b1[[2]], col = "brown", lwd = 2)
# the precision of group-level random effects
plot(myres_inla$marginals$b1[[3]], type = "l", col = "blue", xlim = c(0, 2))
lines(myres_c$marginals$b1[[3]], col = "brown", lwd = 2)
plot(myres_inla$marginals$b2[[1]], type = "l", col = "blue")
lines(myres_c$marginals$b2[[1]], col = "brown", lwd = 2)
plot(myres_inla$marginals$b2[[1]], type = "l", col = "blue")
lines(myres_c$marginals$b2[[1]], col = "brown", lwd = 2)
# the precision of group-level random effects
plot(myres_inla$marginals$b2[[2]], type = "l", col = "blue", xlim = c(0, 2))
lines(myres_c$marginals$b2[[2]], col = "brown", lwd = 2)
par(default_par) # reset par settings
### these are very similar although not exactly identical
## use internal code but only to compute a single parameter over a specified
## grid.
## This can be necessary if the simple auto grid finding functions does
## a poor job.
myres_c <- fitAbn(dag = ~b1 | b2,
data.df = ex3.dag.data[, c(1, 2, 14)],
data.dists = mydists,
group.var = "group",
cor.vars = c("b1", "b2"),
centre = FALSE,
compute.fixed = TRUE,
control = list(marginal.node = 1,
marginal.param = 3, # precision term in node 1
variate.vec = seq(0.05, 1.5, len = 25),
max.hessian.error = 10E-02))
default_par <- par(no.readonly = TRUE) # save default par settings
par(mfrow = c(1, 2))
plot(myres_c$marginals$b1[[1]], type = "l", col = "blue") # still fairly sparse
# An easy way is to use spline to fill in the density without recomputing other
# points provided the original grid is not too sparse.
plot(spline(myres_c$marginals$b1[[1]], n = 100), type = "b", col = "brown")
par(default_par) # reset par settings
## End(Not run)
Regress each node on its parents.#'
Description
Regress each node on its parents.#'
Usage
forLoopContentFitBayes(
child = NULL,
dag = NULL,
data.df = NULL,
var.types = NULL,
grouped.vars = NULL,
group.ids = NULL,
control = NULL,
INLA.marginals = NULL,
verbose = NULL,
force.method = NULL,
data.dists = NULL,
mymodes = NULL,
error.code = NULL,
hessian.accuracy = NULL,
mymargs = NULL
)
Arguments
child |
integer of node to be regressed |
var.types |
vector of numeric encoding of distribution types. See |
INLA.marginals |
vector of logicals indicating which nodes are to be fitted using INLA |
mymodes |
Empty list of modes for each node |
error.code |
Empty element of error codes for each node |
hessian.accuracy |
Empty element of hessian accuracies for each node |
mymargs |
Empty list of marginals for each node |
Value
list of mlik, modes, marginals, error codes, hessian accuracies and a logical if INLA was used for each node.
Formula to adjacency matrix
Description
Internal function that produce a square matrix length(name) with 0,1
depending on f.
f have to start with ~ terms are entries of name terms are separated by + term1 | term2 indicates
col(term1) row(term2) puts a 1 term1 | term2:term3: ... : is used as a sep . = all terms in name
Usage
formula_abn(f, name)
Value
A square matrix
Toy Data Set for Examples in README
Description
1000 observations with 5 variables: 2 continuous, 2 binary and 1 categorical.
Usage
g2b2c_data
Format
A data frame with five columns. Binary and categorical variables are factors.
- G1
gaussian
- B1
binomial
- B2
binomial
- C
categorical
- G2
gaussian
Toy Data Set for Examples in README
Description
10000 observations with 6 variables: 2 continuous, 1 binary, 1 count, 1 categorical and 1 grouping factor.
Usage
g2pbcgrp
Format
A data frame with six columns. Binary and categorical variables are factors.
- G1
gaussian
- P
poisson
- B
binomial
- C
categorical
- G2
gaussian
- group
categorical
Bugs code for Gaussian response
Description
Bugs model for a normal distributed response variable
X \sim \mathcal{N}(\mu,\,\sigma^{2})
.
Usage
gauss_bugs(nodename, nodesintercept, parentnames, parentcoefs, std)
gauss_bugsGroup(
nodename,
nodesintercept,
parentnames,
parentcoefs,
sigma,
sigma_alpha
)
Arguments
nodename |
character string of response variable name. |
nodesintercept |
overall mean of response. Parameter from fixed-effects intercept. |
parentnames |
single character string (for one parent) or vector of characters (for multiple parent nodes) with parent node (predictor variables) names. |
parentcoefs |
overall slope for each predictor (parent node) variable (fixed-effects). |
std |
integer with standard deviation of response variable that will be converted to precision (see Details). |
sigma |
within-group variance. Parameter from random-effects residual. |
sigma_alpha |
between-group variance. Parameter from random-effects intercept. |
Details
The variance of the normal distribution is \frac{1}{\tau}
.
Value
Bugs model returned as stdout.
Functions
-
gauss_bugsGroup()
: Bugs code for Gaussian response with varying intercept
See Also
Examples
gauss_bugs(nodename = "a",
parentnames = c("b", "c"),
nodesintercept = c(0.318077),
parentcoefs = list("b"=c(b=0.3059395),
"c"=c(c=0.5555)),
std = c(0.05773503))
function to extract quantiles from INLA output
Description
function to get to extract quantiles
Usage
get.quantiles(mylist, quantiles, single)
get.ind.quantiles(outmat, inmat)
Arguments
mylist |
list of matrices of two cols x, y |
quantiles |
vector with the desired quantiles |
single |
NULL or TRUE if only a single node and parameter |
outmat |
matrix where the first col has the desired quantiles. We want to estimate this and out in into the second col |
inmat |
is the actual x,f(x) matrix |
Value
list
matrix
Functions
-
get.ind.quantiles()
: helper function for get.quantiles
Create ordered vector with integers denoting the distribution
Description
gaussian = 1, binomial = 2, poisson = 3, multinomial = 4
Usage
get.var.types(data.dists = NULL)
Arguments
data.dists |
list specifying each columns distribution type. Names correspond to column names and values must be one of "gaussian", "binomial", "poisson", "multinomial". |
Value
numeric encoding of distribution corresponding to its list element number in data.dists
.
Extract Standard Deviations from all Gaussian Nodes
Description
Extract Standard Deviations from all Gaussian Nodes
Usage
getMSEfromModes(modes, dists)
Arguments
modes |
list of modes. |
dists |
list of distributions. |
Value
named numeric vector. Names correspond to node name. Value to standard deviations.
function to extract marginals from INLA output
Description
function to extract marginals from INLA output
Usage
getMargsINLA(list.fixed, list.hyper)
Arguments
list.fixed |
list of matrices of two cols x, y |
list.hyper |
list of hyperparameters |
Value
vector
function to extract the mode from INLA output
Description
function to extract the mode from INLA output
Usage
getModeVector(list.fixed, list.hyper)
Arguments
list.fixed |
list of matrices of two cols x, y |
list.hyper |
list of hyperparameters |
Value
vector
Internal function called by fitAbn.bayes
.
Description
Function for computing marginal posterior densities using C and is called from fit.dag() Only to be called internally.
Usage
getmarginals(
res.list,
data.df,
dag.m,
var.types,
max.parents,
mean,
prec,
loggam.shape,
loggam.inv.scale,
max.iters,
epsabs,
verbose,
error.verbose,
trace,
grouped.vars,
group.ids,
epsabs.inner,
max.iters.inner,
finite.step.size,
hessian.params,
max.iters.hessian,
min.pdf,
marginal.node,
marginal.param,
variate.vec,
n.grid,
INLA.marginals,
iter.max,
max.hessian.error,
factor.brent,
maxiters.hessian.brent,
num.intervals.brent
)
Arguments
res.list |
rest of arguments as for call to C fitabn |
data.df |
a data frame containing the data used for learning the network, binary variables must be declared as factors, and no missing values all allowed in any variable. |
dag.m |
adjacency matrix |
var.types |
distributions in terms of a numeric code |
max.parents |
max number of parents over all nodes in dag (different from other |
mean |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec |
the prior precision ( |
loggam.shape |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
max.iters |
total number of iterations allowed when estimating the modes in Laplace approximation. passed to |
epsabs |
absolute error when estimating the modes in Laplace approximation for models with no random effects. Passed to |
verbose |
if |
error.verbose |
logical, additional output in the case of errors occurring in the optimization. Passed to |
trace |
Non-negative integer. If positive, tracing information on the progress of the "L-BFGS-B" optimization is produced. Higher values may produce more tracing information. (There are six levels of tracing. To understand exactly what these do see the source code.). Passed to |
grouped.vars |
result returned from |
group.ids |
result returned from |
epsabs.inner |
absolute error in the maximization step in the (nested) Laplace approximation for each random effect term. Passed to |
max.iters.inner |
total number of iterations in the maximization step in the nested Laplace approximation. Passed to |
finite.step.size |
suggested step length used in finite difference estimation of the derivatives for the (outer) Laplace approximation when estimating modes. Passed to |
hessian.params |
a numeric vector giving parameters for the adaptive algorithm, which determines the optimal stepsize in the finite-difference estimation of the hessian. First entry is the initial guess, second entry absolute error. Passed to |
max.iters.hessian |
integer, maximum number of iterations to use when determining an optimal finite difference approximation (Nelder-Mead). Passed to |
min.pdf |
the value of the posterior density function below which we stop the estimation only used when computing marginals, see details. |
marginal.node |
used in conjunction with |
marginal.param |
used in conjunction with |
variate.vec |
a vector containing the places to evaluate the posterior marginal density, must be supplied if |
n.grid |
recompute density on an equally spaced grid with |
INLA.marginals |
vector - TRUE if INLA used false otherwise |
iter.max |
same as |
max.hessian.error |
if the estimated log marginal likelihood when using an adaptive 5pt finite-difference rule for the Hessian differs by more than |
factor.brent |
if using Brent-Dekker root bracketing method then define the outer most interval end points as the best estimate of |
maxiters.hessian.brent |
maximum number of iterations allowed in the Brent-Dekker method. Passed to |
num.intervals.brent |
the number of initial different bracket segments to try in the Brent-Dekker method. Passed to |
Value
A named list with "modes", "error.code", "hessian.accuracy", "error.code.desc", "mliknode", "mlik", "mse", "coef", "used.INLA", "marginals".
See Also
Other Bayes:
buildScoreCache()
,
calc.node.inla.glm()
,
calc.node.inla.glmm()
,
fitAbn()
Compute standard information for a DAG.
Description
This function returns standard metrics for DAG description. A list that contains the number of nodes, the number of arcs, the average Markov blanket size, the neighborhood average set size, the parent average set size and children average set size.
Usage
infoDag(object, node.names = NULL)
Arguments
object |
an object of class |
node.names |
a vector of names if the DAG is given via formula, see details. |
Details
This function returns a named list with the following entries: the number of nodes, the number of arcs, the average Markov blanket size, the neighborhood average set size, the parent average set size, and the children's average set size.
The dag
can be provided using a formula statement (similar to glm).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separator between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
Value
A named list that contains following entries: the number of nodes, the number of arcs, the average Markov blanket size, the neighborhood average set size, the parent average set size and children average set size.
References
West, D. B. (2001). Introduction to graph theory. Vol. 2. Upper Saddle River: Prentice Hall.
Examples
## Creating a dag:
dag <- matrix(c(0,0,0,0, 1,0,0,0, 1,1,0,1, 0,1,0,0), nrow = 4, ncol = 4)
dist <- list(a="gaussian", b="gaussian", c="gaussian", d="gaussian")
colnames(dag) <- rownames(dag) <- names(dist)
infoDag(dag)
plot(createAbnDag(dag = dag, data.dists = dist))
Iterative Reweighed Least Square algorithm for Binomials
Description
IRLS to estimate network score of Binomial nodes.
Usage
irls_binomial_cpp(A, b, maxit, tol)
Value
a list
BR Iterative Reweighed Least Square algorithm for Binomials
Description
IRLS to estimate network score of Binomial nodes.
Usage
irls_binomial_cpp_br(A, b, maxit, tol)
Value
a list
Fast Iterative Reweighed Least Square algorithm for Binomials
Description
IRLS to estimate network score of Binomial nodes.
Usage
irls_binomial_cpp_fast(A, b, maxit, tol)
Value
a list
Fast Br Iterative Reweighed Least Square algorithm for Binomials
Description
IRLS to estimate network score of Binomial nodes.
Usage
irls_binomial_cpp_fast_br(A, b, maxit, tol)
Value
a list
Iterative Reweighed Least Square algorithm for Gaussians
Description
IRLS to estimate network score of Gaussian nodes.
Usage
irls_gaussian_cpp(A, b, maxit, tol)
Value
a list
Fast Iterative Reweighed Least Square algorithm for Gaussians
Description
IRLS to estimate network score of Gaussian nodes.
Usage
irls_gaussian_cpp_fast(A, b, maxit, tol)
Iterative Reweighed Least Square algorithm for Poissons
Description
IRLS to estimate network score of Poisson nodes.
Usage
irls_poisson_cpp(A, b, maxit, tol)
Value
a list
Fast Iterative Reweighed Least Square algorithm for Poissons
Description
IRLS to estimate network score of Poisson nodes.
Usage
irls_poisson_cpp_fast(A, b, maxit, tol)
Value
a list
Returns the strengths of the edge connections in a Bayesian Network learned from observational data
Description
A flexible implementation of multiple proxy for strength measures useful for visualizing the edge connections in a Bayesian Network learned from observational data.
Usage
linkStrength(dag,
data.df = NULL,
data.dists = NULL,
method = c("mi.raw",
"mi.raw.pc",
"mi.corr",
"ls",
"ls.pc",
"stat.dist"),
discretization.method = "doane")
Arguments
dag |
a matrix or a formula statement (see details for format) defining
the network structure, a directed acyclic graph (DAG).
Note that rownames must be set or given in |
data.df |
a data frame containing the data used for learning each node, binary variables must be declared as factors. |
data.dists |
a named list giving the distribution for each node in the network, see ‘Details’. |
method |
the method to be used. See ‘Details’. |
discretization.method |
a character vector giving the discretization
method to use. See |
Details
This function returns multiple proxies for estimating the connection strength
of the edges of a possibly discretized Bayesian network's data set.
The returned connection strength measures are: the Raw Mutual Information
(mi.raw
), the Percentage Mutual information (mi.raw.pc
),
the Raw Mutual Information computed via correlation (mi.corr
),
the link strength (ls
), the percentage link strength (ls.pc
)
and the statistical distance (stat.dist
).
The general concept of entropy is defined for probability distributions. The probability is estimated from data using frequency tables. Then the estimates are plug-in in the definition of the entropy to return the so-called empirical entropy. A standard known problem of empirical entropy is that the estimations are biased due to the sampling noise. This is also known that the bias will decrease as the sample size increases. The mutual information estimation is computed from the observed frequencies through a plug-in estimator based on entropy. For the case of an arc going from the node X to the node Y and the remaining set of parent of Y is denoted as Z.
The mutual information is defined as I(X, Y) = H(X) + H(Y) - H(X, Y), where H() is the entropy.
The Percentage Mutual information is defined as PI(X,Y) = I(X,Y)/H(Y|Z).
The Mutual Information computed via correlation is defined as MI(X,Y) = -0.5 log(1-cor(X,Y)^2).
The link strength is defined as LS(X->Y) = H(Y|Z)-H(Y|X,Z).
The percentage link strength is defined as PLS(X->Y) = LS(X->Y) / H(Y|Z).
The statistical distance is defined as SD(X,Y) = 1- MI(X,Y) / max(H(X),H(Y)).
Value
The function returns a named matrix with the requested metric.
References
Boerlage, B. (1992). Link strength in Bayesian networks. Diss. University of British Columbia. Ebert-Uphoff, Imme. "Tutorial on how to measure link strengths in discrete Bayesian networks." (2009).
Examples
# Gaussian
N <- 1000
mydists <- list(a="gaussian",
b="gaussian",
c="gaussian")
a <- rnorm(n = N, mean = 0, sd = 1)
b <- 1 + 2*rnorm(n = N, mean = 5, sd = 1)
c <- 2 + 1*a + 2*b + rnorm(n = N, mean = 2, sd = 1)
mydf <- data.frame("a" = a,
"b" = b,
"c" = c)
mycache.mle <- buildScoreCache(data.df = mydf,
data.dists = mydists,
method = "mle",
max.parents = 2)
mydag.mp <- mostProbable(score.cache = mycache.mle, verbose = FALSE)
linkstr <- linkStrength(dag = mydag.mp$dag,
data.df = mydf,
data.dists = mydists,
method = "ls",
discretization.method = "sturges")
Print logLik of objects of class abnFit
Description
Print logLik of objects of class abnFit
Usage
## S3 method for class 'abnFit'
logLik(object, digits = 3L, verbose = TRUE, ...)
Arguments
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
Value
prints the logLik of the fitted model.
Logit of proportions
Description
See also the C implementation ?abn::logit_cpp()
.
Usage
logit(x)
Arguments
x |
numeric with values between |
Value
numeric vector of same length as x
.
numeric vector of same length as x
.
logit functions
Description
transform x
either via the logit, or expit.
Usage
logit_cpp(x)
Arguments
x |
a numeric vector |
Value
a numeric vector
Make BUGS model from fitted DAG
Description
Make BUGS model from fitted DAG
Usage
makebugs(dag, data.dists, coefs, stderrors)
Arguments
dag |
named adjacency matrix representing the DAG. Names correspond to node names. |
data.dists |
list of node distributions. |
coefs |
a list named by the node names containing for each element a matrix with the nodes' coefficients. |
stderrors |
a list named by the node names containing for each element a matrix with the nodes' standard errors |
Value
Bugs model returned as stdout.
See Also
simulateAbn gauss_bugs bern_bugs categorical_bugs pois_bugs
Examples
## Prepare data and arguments
mydists <- list(a="gaussian",
b="multinomial",
c="binomial",
d="poisson")
mydag <- matrix(0, 4, 4, byrow = TRUE,
dimnames = list(c("a", "b", "c", "d"),
c("a", "b", "c", "d")))
mydag[2,1] <- mydag[3,2] <- mydag[4,3] <- 1
# plotAbn(mydag, data.dists = mydists)
mycoefs <- list("a"=matrix(-6.883383e-17, byrow = TRUE,
dimnames = list(NULL,
"a|intercept")),
"b"=matrix(c(2.18865, 3.133928, 3.138531, 1.686432, 3.134161, 5.052104),
nrow= 1, byrow = TRUE,
dimnames = list(c(NULL),
c("b|intercept.2", "b|intercept.3", "b|intercept.4",
"a.2", "a.3", "a.4"))),
"c"=matrix(c(1.11, 2.22, 3.33, 4.44, 5.55),
nrow= 1, byrow = TRUE,
dimnames = list(c(NULL),
c("c|intercept", "b1", "b2", "b3", "b4"))),
"d"=matrix(c(3.33, 4.44),
nrow= 1, byrow = TRUE,
dimnames = list(c(NULL),
c("d|intercept", "c"))))
mymse <- c("a"=0,"b"=1,"c"=2,"d"=3)
## Make BUGS model
makebugs(dag = mydag, data.dists = mydists, coefs = mycoefs, stderrors = mymse)
Make BUGS model from fitted DAG with grouping
Description
Make BUGS model from fitted DAG with grouping
Usage
makebugsGroup(
dag,
data.dists,
stderrors,
group.var,
mu,
betas,
sigma,
sigma_alpha
)
Arguments
dag |
named adjacency matrix representing the DAG. Names correspond to node names. |
data.dists |
list of node distributions. |
stderrors |
a list named by the node names containing for each element a matrix with the nodes' standard errors |
group.var |
only applicable for mixed models and gives the column name in |
mu |
Standard deviation of fixed effects. |
betas |
Coefficients/slopes of fixed effects . |
sigma |
variance of random effects. |
sigma_alpha |
variance-covariance matrix corresponding to covariances output from |
Value
Bugs model returned as stdout.
See Also
simulateAbn gauss_bugsGroup bern_bugsGroup categorical_bugsGroup pois_bugsGroup
Compute the Markov blanket
Description
This function computes the Markov blanket of a set of nodes given a DAG (Directed Acyclic Graph).
Usage
mb(dag, node, data.dists = NULL, data.df = NULL)
Arguments
dag |
a matrix or a formula statement (see details for format) defining the network structure, a directed acyclic graph (DAG). |
node |
a character vector of the nodes for which the Markov Blanket should be returned. |
data.dists |
a named list giving the distribution for each node in the network, see details. |
data.df |
a data frame containing the data for the nodes in the network. Only needed if |
Details
This function returns the Markov Blanket of a set of nodes given a DAG.
The dag
can be provided as a matrix where the rows and columns are the nodes names.
The matrix should be binary, where 1 indicates an edge from the column node (parent) to the row node (child).
The diagonal of the matrix should be 0 and the matrix should be acyclic.
The nodes names should be the same as the names of the distributions in data.dists
.
Alternatively, the dag
can be provided using a formula statement (similar to glm).
This requires the data.dists
and data.df
arguments to be provided.
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1
has two parents (parent1
and parent2
).
node2
and node3
are children of the same parent (parent3
).
The parents names have to exactly match those given in name
.
:
is the separator between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in name
.
Value
character vector of node names from the Markov blanket.
Examples
## Defining distribution and dag
dist <- list(a="gaussian", b="gaussian", c="gaussian", d="gaussian",
e="binomial", f="binomial")
dag <- matrix(c(0,1,1,0,1,0,
0,0,1,1,0,1,
0,0,0,0,0,0,
0,0,0,0,0,0,
0,0,0,0,0,1,
0,0,0,0,0,0), nrow = 6L, ncol = 6L, byrow = TRUE)
colnames(dag) <- rownames(dag) <- names(dist)
mb(dag, node = "b", data.dists = dist)
mb(dag, node = c("b","e"), data.dists = dist)
Empirical Estimation of the Entropy from a Table of Counts
Description
This function empirically estimates the Mutual Information from a table of counts using the observed frequencies.
Usage
miData(freqs.table, method = c("mi.raw", "mi.raw.pc"))
Arguments
freqs.table |
a table of counts. |
method |
a character determining if the Mutual Information should be normalized. |
Details
The mutual information estimation is computed from the observed frequencies through a plugin estimator based on entropy.
The plugin estimator is
I(X, Y) = H (X) + H(Y) - H(X, Y)
, where
H()
is the entropy computed with entropyData
.
Value
Mutual information estimate.
integer
References
Cover, Thomas M, and Joy A Thomas. (2012). "Elements of Information Theory". John Wiley & Sons.
See Also
Examples
## Generate random variable
Y <- rnorm(n = 100, mean = 0, sd = 2)
X <- rnorm(n = 100, mean = 5, sd = 2)
dist <- list(Y="gaussian", X="gaussian")
miData(discretization(data.df = cbind(X,Y), data.dists = dist,
discretization.method = "fd", nb.states = FALSE),
method = "mi.raw")
Mutual Information
Description
Calculates the mutual information.
Usage
mi_cpp(joint_dist)
Value
a double
Convert modes to fitAbn.mle$coefs structure
Description
Convert modes to fitAbn.mle$coefs structure
Usage
modes2coefs(modes)
Arguments
modes |
list of modes. |
Value
list of matrix arrays.
Find most probable DAG structure
Description
Find most probable DAG structure using exact order based approach due to Koivisto and Sood, 2004.
Usage
mostProbable(score.cache, score="bic", prior.choice=1, verbose=TRUE, ...)
Arguments
score.cache |
object of class |
score |
which score should be used to score the network. Possible choices are |
prior.choice |
an integer, 1 or 2, where 1 is a uniform structural prior and 2 uses a weighted prior, see details |
verbose |
if TRUE then provides some additional output. |
... |
further arguments passed to or from other methods. |
Details
The procedure runs the exact order based structure discovery approach of Koivisto and Sood (2004) to find the most probable posterior network (DAG). The local.score is the node cache, as created using buildScoreCache
(or manually provided the same format is used). Note that the scope of this search is given by the options used in local.score, for example, by restricting the number of parents or the ban or retain constraints given there.
This routine can take a long time to complete and is highly sensitive to the number of nodes in the network. It is recommended to use this on a reduced data set to get an idea as to the computational practicality of this approach. In particular, memory usage can quickly increase to beyond what may be available. For additive models, problems comprising up to 20 nodes are feasible on most machines. Memory requirements can increase considerably after this, but then so does the run time making this less practical. It is recommended that some form of over-modeling adjustment is performed on this resulting DAG (unless dealing with vast numbers of observations), for example, using parametric bootstrapping, which is straightforward to implement in MCMC engines such as JAGS or WinBUGS. See the case studies at https://r-bayesian-networks.org/
or the files provided in the package directories inst/bootstrapping_example
and inst/old_vignette
for details.
The parameter prior.choice
determines the prior used within each node for a given choice of parent combination. In Koivisto and Sood (2004) p.554, a form of prior is used, which assumes that the prior probability for parent combinations comprising of the same number of parents are all equal. Specifically, that the prior probability for parent set G with cardinality |G| is proportional to 1/[n-1 choose |G|]
where there are n total nodes. Note that this favors parent combinations with either very low or very high cardinality, which may not be appropriate. This prior is used when prior.choice=2
. When prior.choice=1
an uninformative prior is used where parent combinations of all cardinalities are equally likely. The latter is equivalent to the structural prior used in the heuristic searches e.g., searchHillclimber
or searchHeuristic
.
Note that the network score (log marginal likelihood) of the most probable DAG is not returned as it can easily be computed using fitAbn
, see examples below.
Value
An object of class abnMostprobable
, which is a list containing: a matrix giving the DAG definition of the most probable posterior structure, the cache of pre-computed scores and the score used for selection.
References
Koivisto, M. V. (2004). Exact Structure Discovery in Bayesian Networks, Journal of Machine Learning Research, vol 5, 549-573.
Examples
## Not run:
##############################
## Example 1
##############################
## This data comes with 'abn' see ?ex1.dag.data
mydat <- ex1.dag.data[1:5000, c(1:7, 10)]
## Setup distribution list for each node:
mydists <- list(b1 = "binomial",
p1 = "poisson",
g1 = "gaussian",
b2 = "binomial",
p2 = "poisson",
b3 = "binomial",
g2 = "gaussian",
g3 = "gaussian")
## Parent limits, for speed purposes quite specific here:
max_par <- list("b1" = 0,
"p1" = 0,
"g1" = 1,
"b2" = 1,
"p2" = 2,
"b3" = 3,
"g2" = 3,
"g3" = 2)
## Now build cache (no constraints in ban nor retain)
mycache <- buildScoreCache(data.df = mydat,
data.dists = mydists,
max.parents = max_par)
## Find the globally best DAG:
mp_dag <- mostProbable(score.cache = mycache)
myres <- fitAbn(object = mp_dag,
create.graph = TRUE)
plot(myres) # plot the best model
## Fit the known true DAG (up to variables 'b4' and 'b5'):
true_dag <- matrix(data = 0, ncol = 8, nrow = 8)
colnames(true_dag) <- rownames(true_dag) <- names(mydists)
true_dag["p2", c("b1", "p1")] <- 1
true_dag["b3", c("b1", "g1", "b2")] <- 1
true_dag["g2", c("p1", "g1", "b2")] <- 1
true_dag["g3", c("g1", "b2")] <- 1
fitAbn(dag = true_dag,
data.df = mydat,
data.dists = mydists)$mlik
#################################################################
## Example 2 - models with random effects
#################################################################
## This data comes with abn see ?ex3.dag.data
mydat <- ex3.dag.data[, c(1:4, 14)]
mydists <- list(b1 = "binomial",
b2 = "binomial",
b3 = "binomial",
b4 = "binomial")
## This takes a few seconds and requires INLA:
mycache_mixed <- buildScoreCache(data.df = mydat,
data.dists = mydists,
group.var = "group",
max.parents = 2)
## Find the most probable DAG:
mp_dag <- mostProbable(score.cache = mycache_mixed)
## and get goodness of fit:
fitAbn(object = mp_dag,
group.var = "group")$mlik
## End(Not run)
Print number of observations of objects of class abnFit
Description
Print number of observations of objects of class abnFit
Usage
## S3 method for class 'abnFit'
nobs(object, ...)
Arguments
object |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
prints the number of observations of the fitted model.
Probability to odds
Description
Probability to odds
Usage
odds(x)
Arguments
x |
numeric vector of probabilities with values between |
Value
numeric vector of same length as x
.
Odds Ratio from a matrix
Description
Compute the odds ratio from a contingency table or a matrix.
Usage
or(x)
Arguments
x |
a 2x2 table or matrix. |
Value
A real value.
Dataset related to diseases present in ‘finishing pigs’, animals about to enter the human food chain at an abattoir.
Description
The data we consider here comprise of a randomly chosen batch of 50 pigs from each of 500 randomly chosen pig producers in the UK. The dataset consists of 25000 observations, 10 binary variables, and a grouping variable. These are ‘finishing pigs’, animals about to enter the human food chain at an abattoir. Further description of the data set is present on the vignette.
Usage
pigs.vienna
Format
A data frame with a mixture of 10 discrete variables, each of which is set as a factor, and a grouping variable.
- PC
Binary.
- PT
Binary.
- MS
Binary.
- HS
Binary.
- TAIL
Binary.
- Abscess
Binary.
- Pyaemia
Binary.
- EPcat
Binary.
- PDcat
Binary.
- plbinary
Binary.
- batch
Group variable, corresponding to the 500 different pig producers
Details
This dataset was used in an older version of the vignette.
See also the files provided in the package directories
inst/bootstrapping_example
and inst/old_vignette
give a
detailed analysis of the dataset and provide more details for a
bootstrapping example thereof.
References
Hartnack, S., et al. (2016) "Attitudes of Austrian veterinarians towards euthanasia in small animal practice: impacts of age and gender on views on euthanasia." BMC Veterinary Research 12.1: 26.
Plots DAG from an object of class abnDag
Description
Plots DAG from an object of class abnDag
Usage
## S3 method for class 'abnDag'
plot(x, ...)
Arguments
x |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
Rgraphviz::plot
Examples
mydag <- createAbnDag(dag = ~a+b|a,
data.df = data.frame("a"=1, "b"=1),
data.dists = list(a="binomial", b="gaussian"))
plot(mydag)
Plot objects of class abnFit
Description
Plot objects of class abnFit
Usage
## S3 method for class 'abnFit'
plot(x, ...)
Arguments
x |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
a plot of the fitted model.
Plot objects of class abnHeuristic
Description
Plot objects of class abnHeuristic
Usage
## S3 method for class 'abnHeuristic'
plot(x, ...)
Arguments
x |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
plot of the scores of the heuristic search.
Plot objects of class abnHillClimber
Description
Plot objects of class abnHillClimber
Usage
## S3 method for class 'abnHillClimber'
plot(x, ...)
Arguments
x |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
plot of the consensus DAG.
Plot objects of class abnMostprobable
Description
Plot objects of class abnMostprobable
Usage
## S3 method for class 'abnMostprobable'
plot(x, ...)
Arguments
x |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
plot of the mostprobable consensus DAG.
Plot an ABN graphic
Description
Plot an ABN DAG using formula statement or a matrix in using Rgraphviz through the graphAM class.
Usage
plotAbn(dag, data.dists=NULL, markov.blanket.node=NULL, fitted.values=NULL,
digits=2, edge.strength=NULL, edge.strength.lwd=5, edge.direction="pc",
edge.color="black", edge.linetype="solid", edge.arrowsize=0.6,
edge.fontsize=node.fontsize, node.fontsize=12,
node.fillcolor=c("lightblue", "brown3", "chartreuse3"),
node.fillcolor.list=NULL,
node.shape=c("circle", "box", "ellipse", "diamond"),
plot=TRUE,
data.df=NULL, ...)
Arguments
dag |
a matrix or a formula statement (see details for format) defining
the network structure, a Directed Acyclic Graph (DAG).
Note that rownames must be set or given in |
data.dists |
a named list giving the distribution for each node in the network, see details. |
markov.blanket.node |
name of variables to display its Markov blanket. |
fitted.values |
modes or coefficents outputted from |
digits |
number of digits to display the |
edge.strength |
a named matrix containing evaluations of edge strength
which will change the arcs width (could be Mutual information, p-values,
number of bootstrap retrieve samples or the outcome of the |
edge.strength.lwd |
maximum line width for |
edge.direction |
character giving the direction in which arcs should
be plotted, |
edge.color |
the colour of the edge. |
edge.linetype |
the linetype of the edge. Defaults to |
edge.arrowsize |
the thickness of the arrows. Not relevant if |
edge.fontsize |
the font size of the arcs fitted values. |
node.fontsize |
the font size of the nodes names. |
node.fillcolor |
the colour of the node. Second and third element is used for the Markov blanket and node of the Markov blanket. |
node.fillcolor.list |
the list of node that should be coloured. |
node.shape |
the shape of the nodes according the Gaussian, binomial, Poisson and multinomial distributions. |
plot |
logical variable, if set to |
data.df |
NULL or a data frame containing the data for the nodes in the network. Only needed if |
... |
arguments passed to the plotting function. |
Details
By default binomial nodes are squares, multinomial nodes are empty, Gaussian nodes are circles and poison nodes are ellipses.
The dag
can be provided using a formula statement (similar to glm). A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The construction is based on the graph package. Properties of the graph can be changed after the construction, see ‘Examples’.
Value
A rendered graph, if plot=TRUE
. The graphAM
object is returned invisibly.
See Also
Examples
# Define distribution list
dist <- list(a = "gaussian",
b = "gaussian",
c = "gaussian",
d = "gaussian",
e = "binomial",
f = "binomial")
# Define a matrix formulation
edge_strength <- matrix(c(0, 0.5, 0.5, 0.7, 0.1, 0,
0, 0, 0.3, 0.1, 0, 0.8,
0, 0, 0, 0.35, 0.66, 0,
0, 0, 0, 0, 0.9, 0,
0, 0, 0, 0, 0, 0.8,
0, 0, 0, 0, 0, 0),
nrow = 6L,
ncol = 6L,
byrow = TRUE)
## Naming of the matrix
colnames(edge_strength) <- rownames(edge_strength) <- names(dist)
## Random Data
df <- data.frame(a = rnorm(100),
b = rnorm(100),
c = rnorm(100),
d = rnorm(100),
e = rbinom(100, 1, 0.5),
f = rbinom(100, 1, 0.5))
## Plot form a matrix
plotAbn(dag = edge_strength,
data.dists = dist)
## Edge strength
plotAbn(dag = ~ a | b:c:d:e + b | c:d:f + c | d:e + d | e + e | f,
data.dists = dist,
edge.strength = edge_strength,
data.df = df)
## Plot from a formula for a different DAG!
plotAbn(dag = ~ a | b:c:e + b | c:d:f + e | f,
data.dists = dist,
data.df = df)
## Markov blanket
plotAbn(dag = ~ a | b:c:e + b | c:d:f + e | f,
data.dists = dist,
markov.blanket.node = "e",
data.df = df)
## Change col for all edges
tmp <- plotAbn(dag = ~ a | b:c:e + b | c:d:f + e | f,
data.dists = dist,
plot = FALSE,
data.df = df)
graph::edgeRenderInfo(tmp) <- list(col = "blue")
Rgraphviz::renderGraph(tmp)
## Change lty for individual ones. Named vector is necessary
tmp <- plotAbn(dag = ~ a | b:c:e + b | c:d:f + e | f,
data.dists = dist,
plot = FALSE,
data.df = df)
edgelty <- rep(c("solid", "dotted"), c(6, 1))
names(edgelty) <- names(graph::edgeRenderInfo(tmp, "col"))
graph::edgeRenderInfo(tmp) <- list(lty = edgelty)
Rgraphviz::renderGraph(tmp)
Bugs code for Poisson response
Description
Bugs model for count response variable
X \sim \mathcal{Pois}(\lambda)
.
Usage
pois_bugs(nodename, nodesintercept, parentnames, parentcoefs)
pois_bugsGroup(nodename, nodesintercept, parentnames, parentcoefs, sigma_alpha)
Arguments
nodename |
character string of response variable name. |
nodesintercept |
overall mean of response. Parameter from fixed-effects intercept. |
parentnames |
single character string (for one parent) or vector of characters (for multiple parent nodes) with parent node (predictor variables) names. |
parentcoefs |
overall slope for each predictor (parent node) variable (fixed-effects). |
sigma_alpha |
between-group variance. Parameter from random-effects intercept. |
Value
Bugs model returned as stdout.
Functions
-
pois_bugsGroup()
: Bugs code for Poisson response with varying intercept
See Also
Examples
pois_bugs(nodename = "a",
parentnames = c("b", "c"),
nodesintercept = c(0.318077),
parentcoefs = list("b"=c(b=0.3059395),
"c"=c(c=0.5555)))
Print objects of class abnCache
Description
Print objects of class abnCache
Usage
## S3 method for class 'abnCache'
print(x, digits = 3, ...)
Arguments
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
Value
summary statement of the class of abnCache
.
Examples
## Subset of the build-in dataset, see ?ex0.dag.data
mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols
## setup distribution list for each node
mydists <- list(b1="binomial", b2="binomial", g1="gaussian",
g2="gaussian", b3="binomial", g3="gaussian")
# Structural constraints
# ban arc from b2 to b1
# always retain arc from g2 to g1
## parent limits
max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2)
## now build the cache of pre-computed scores accordingly to the structural constraints
if(requireNamespace("INLA", quietly = TRUE)){
# Run only if INLA is available
res.c <- buildScoreCache(data.df=mydat, data.dists=mydists,
dag.banned= ~b1|b2, dag.retained= ~g1|g2, max.parents=max.par)
print(res.c)
}
Print objects of class abnDag
Description
Print objects of class abnDag
Usage
## S3 method for class 'abnDag'
print(x, digits = 3L, ...)
Arguments
x |
Object of class |
digits |
number of digits of the adjacency matrix. |
... |
additional parameters. Not used at the moment. |
Value
outputs adjacency matrix and statement of the class of x
.
Examples
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1))
print(mydag)
Print objects of class abnFit
Description
Print objects of class abnFit
Usage
## S3 method for class 'abnFit'
print(x, digits = 3L, ...)
Arguments
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
Value
prints the parameters of the fitted model.
Print objects of class abnHeuristic
Description
Print objects of class abnHeuristic
Usage
## S3 method for class 'abnHeuristic'
print(x, digits = 2L, ...)
Arguments
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
Value
prints the best score found and the distribution of the scores.
Print objects of class abnHillClimber
Description
Print objects of class abnHillClimber
Usage
## S3 method for class 'abnHillClimber'
print(x, digits = 3L, ...)
Arguments
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
Value
prints the consensus DAG and the class of the object.
Print objects of class abnMostprobable
Description
Print objects of class abnMostprobable
Usage
## S3 method for class 'abnMostprobable'
print(x, digits = 3L, ...)
Arguments
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
Value
prints the mostprobable consensus DAG.
Rank of a matrix
Description
similar to base::rank
Usage
rank_cpp(A)
Value
an integer
Compute the score's contribution in a network of each observation.
Description
This function computes the score's contribution of each observation to the total network score.
Usage
scoreContribution(object = NULL,
dag = NULL, data.df = NULL, data.dists = NULL,
verbose = FALSE)
Arguments
object |
an object of class ' |
dag |
a matrix or a formula statement (see details) defining the network structure, a directed acyclic graph (DAG), see details for format. Note that colnames and rownames must be set. |
data.df |
a data frame containing the data used for learning the network, binary variables must be declared as factors and no missing values all allowed in any variable. |
data.dists |
a named list giving the distribution for each node in the network, see details. |
verbose |
if |
Details
This function computes the score contribution of each observation
to the total network score.
This function is available only in the mle
settings.
To do so one uses the glm
and predict
functions.
This function is an attempt to perform diagnostic for an ABN analysis.
Value
A named list that contains the scores contributions: maximum likelihood, aic, bic, mdl and diagonal values of the hat matrix.
Examples
## Not run:
## Use a subset of a built-in simulated data set
mydat <- ex1.dag.data[,c("b1","g1","p1")]
## setup distribution list for each node
mydists <- list(b1="binomial", g1="gaussian", p1="poisson")
## now build cache
mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = 2, method = "mle")
## Find the globally best DAG
mp.dag <- mostProbable(score.cache=mycache, score="bic", verbose = FALSE)
out <- scoreContribution(object = mp.dag)
## Observations contribution per network node
boxplot(out$bic)
## End(Not run)
A family of heuristic algorithms that aims at finding high scoring directed acyclic graphs
Description
A flexible implementation of multiple greedy search algorithms to find high scoring network (DAG)
Usage
searchHeuristic(score.cache, score = "mlik",
num.searches = 1, seed = 42L, start.dag = NULL,
max.steps = 100,
algo = "hc", tabu.memory = 10, temperature = 0.9,
verbose = FALSE, ...)
Arguments
score.cache |
output from |
score |
which score should be used to score the network. Possible choices are |
num.searches |
a positive integer giving the number of different search to run, see details. |
seed |
a non-negative integer which sets the seed. |
start.dag |
a DAG given as a matrix, see details for format, which can be used to explicity provide a starting point for the structural search. |
max.steps |
a constant giving the number of search steps per search, see details. |
algo |
which heuristic algorithm should be used. Possible choices are: |
tabu.memory |
a non-negative integer number to set the memory of the |
temperature |
a real number giving the update in temperature for the |
verbose |
if TRUE then provides some additional output. |
... |
further arguments passed to or from other methods. |
Details
This function is a flexible implementation of multiple greedy heuristic algorithms,
particularly well adapted to the abn
framework.
It targets multi-random restarts heuristic algorithms.
The user can select the num.searches
and the maximum number of steps
within by max.steps
. The optimization algorithm within each search is
relatively rudimentary.
The function searchHeuristic
is different from the
searchHillClimber
in the sense that this function is fully
written in R, whereas the searchHillClimber
is written in C
and thus expected to be faster. The function searchHillClimber
at each hill-climbing step consider every information from the pre-computed
scores cache while the function searchHeuristic
performs a local
stepwise optimization. This function chooses a structural move (or edge move)
and compute the score's change. On this point, it is closer to the MCMCMC
algorithm from Madigan and York (1995) and Giudici and Castelo (2003)
with a single edge move.
If the user select random
, then a random valid DAG is selected.
The routine used favourise low density structure.
The function implements three algorithm selected with the
parameter algo
: hc
, tabu
or sa
.
If algo=hc
:
The Hill-climber algorithm (hc
) is a single move algorithm.
At each Hill-climbing step within a search an arc is attempted to be added.
The new score is computed and compared to the previous network's score.
If algo=tabu
:
The same algorithm is as with hc
is used, but a list of banned moves
is computed. The parameter tabu.memory
controls the length of the tabu
list. The idea is that the classical Hill-climber algorithm is inefficient
when it should cross low probability regions to unblock from a local maximum
and reaching a higher score peak. By forcing the algorithm to choose some not
already used moves, this will force the algorithm to escape the local maximum.
If algo=sa
:
This variant of the heuristic search algorithm is based on simulated annealing
described by Metropolis et al. (1953).
Some accepted moves could result in a decrease of the network score.
The acceptance rate can be monitored with the parameter temperature
.
Value
An object of class abnHeuristic
(which extends the class abnLearnd
) and contains list with entires:
- dags
a list of DAGs
- scores
a vector giving the network score for the locally optimal network for each search
- detailed.score
a vector giving the evolution of the network score for the all the random restarts
- score
a number giving the network score for the locally optimal network
- score.cache
the pre-computed cache of scores
- num.searches
a numeric giving the number of random restart
- max.steps
a numeric giving the maximal number of optimization steps within each search
- algorithm
a character for indicating the algorithm used
References
Heckerman, D., Geiger, D. and Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197-243. Madigan, D. and York, J. (1995) "Bayesian graphical models for discrete data". International Statistical Review, 63:215232. Giudici, P. and Castelo, R. (2003). "Improving Markov chain Monte Carlo model search for data mining". Machine Learning, 50:127158. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). "Equation of state calculations by fast computing machines". The journal of chemical physics, 21(6), 1087-1092.
Examples
## Not run:
##############################################
## example: use built-in simulated data set
##############################################
mydat <- ex1.dag.data ## this data comes with abn see ?ex1.dag.data
## setup distribution list for each node
mydists<-list(b1="binomial", p1="poisson", g1="gaussian", b2="binomial",
p2="poisson", b3="binomial", g2="gaussian", b4="binomial",
b5="binomial", g3="gaussian")
mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = 2)
## Now peform 10 greedy searches
heur.res <- searchHeuristic(score.cache = mycache, data.dists = mydists,
start.dag = "random", num.searches = 10,
max.steps = 50)
## Plot (one) dag
plotAbn(heur.res$dags[[1]], data.dists = mydists)
## End(Not run)
Find high scoring directed acyclic graphs using heuristic search.
Description
Find high scoring network (DAG) structures using a random re-starts greedy hill-climber heuristic search.
Usage
searchHillClimber(score.cache, score = "mlik", num.searches = 1, seed = 42,
start.dag = NULL, support.threshold = 0.5, timing.on = TRUE,
dag.retained = NULL, verbose = FALSE, ...)
Arguments
score.cache |
output from |
score |
character giving which network score should be used to
select the structure. Currently |
num.searches |
number of times to run the search. |
seed |
non-negative integer which sets the seed in the GSL random number generator. |
start.dag |
a DAG given as a matrix, see details for format, which can be used to provide a starting point for the structural search explicitly. |
support.threshold |
the proportion of search results - each locally optimal DAG - in which each arc must appear to be a part of the consensus network. |
timing.on |
extra output in terms of duration computation. |
dag.retained |
a DAG given as a matrix, see details for format. This is necessary if the score.cache was created using an explicit retain matrix, and the same retain matrix should be used here. dag.retained is used by the algorithm which generates the initial random DAG to ensure that the necessary arcs are retained. |
verbose |
extra output. |
... |
further arguments passed to or from other methods. |
Details
The procedure runs a greedy hill-climbing search similar, but not identical,
to the method presented initially in Heckerman et al. 1995.
(Machine Learning, 20, 197-243).
Each search begins with a randomly chosen DAG structure where this is
constructed in such a way as to attempt to choose a DAG uniformly from
the vast landscape of possible structures. The algorithm used is as follows:
given a node cache (from buildScoreCache
,
then within this set of all allowed local parent combinations,
a random combination is chosen for each node.
This is then combined into a full DAG, which is then checked for cycles,
where this check iterates over the nodes in a random order.
If all nodes have at least one child (i.e., at least one cycle is present),
then the first node examined has all its children removed, and the check for
cycles is then repeated. If this has removed the only cycle present,
then this DAG is used at the starting point for the search,
if not, a second node is chosen (randomly) and the process is then
repeated until a DAG is obtained.
The actual hill-climbing algorithm itself differs slightly from that presented in Heckerman et al. as a full cache of all possible local combinations are available. At each hill-climbing step, everything in the node cache is considered. In other words, all possible single swaps between members of cache currently present in the DAG and those in the full cache. The single swap, which provides the greatest increase in goodness of fit is chosen. A single swap here is the removal or addition of any one node-parent combination present in the cache while avoiding a cycle. This means that as well as all single arc changes (addition or removal), multiple arc changes are also considered at each same step, note however that arc reversal in this scheme takes two steps (as this requires first removal of a parent arc from one node and then addition of a parent arc to a different node). The original algorithm perturbed the current DAG by only a single arc at each step but also included arc reversal. The current implementation may not be any more efficient than the original but is arguably more natural given a pre-computed cache of local scores.
A start DAG may be provided in which case num.searches must equal 1 - this option is really just to provide a local search around a previously identified optimal DAG.
This function is designed for two different purposes:
i) interactive visualization; and
ii) longer batch processing. The former provides easy visual "eyeballing" of
data in terms of its majority consensus network (or similar threshold),
which is a graphical structure which comprises of all arcs which feature in
a given proportion (support.threshold
) of locally optimal DAGs already
identified during the run. The general hope is that this structure will
stabilize - become fixed - relatively quickly, at least for problems with
smaller numbers of nodes.
Value
A list with entries:
- init.score
a vector giving network score for initial network from which the search commenced
- final.score
a vector giving the network score for the locally optimal network
- init.dag
list of matrices, initial DAGs
- final.dag
list of matrices, locally optimal DAGs
- consensus
a matrix holding a binary graph, not necessary a DAG!
- support.threshold
percentage supported used to create consensus matrix
References
Lewis, F. I., and McCormick, B. J. J. (2012). Revealing the complexity of health determinants in resource poor settings. American Journal Of Epidemiology. DOI:10.1093/aje/KWS183).
Simulate data from a fitted additive Bayesian network.
Description
Simulate data from a fitted additive Bayesian network.
Usage
simulateAbn(
object = NULL,
run.simulation = TRUE,
bugsfile = NULL,
n.chains = 10L,
n.adapt = 1000L,
n.thin = 100L,
n.iter = 10000L,
seed = 42L,
verbose = FALSE,
debug = FALSE
)
Arguments
object |
of type |
run.simulation |
call JAGS to simulate data (default is |
bugsfile |
A path to a valid file or |
n.chains |
number of parallel chains for the model. |
n.adapt |
number of iteration for adaptation. If |
n.thin |
thinning interval for monitors. |
n.iter |
number of iteration to monitor. |
seed |
by default set to 42. |
verbose |
if TRUE prints additional output |
debug |
if TRUE prints bug file content to stdout and does not run simulations. |
Value
data.frame
See Also
Examples
df <- FCV[, c(12:15)]
mydists <- list(Outdoor="binomial",
Sex="multinomial",
GroupSize="poisson",
Age="gaussian")
## buildScoreCache -> mostProbable() -> fitAbn()
suppressWarnings({
mycache.mle <- buildScoreCache(data.df = df, data.dists = mydists, method = "mle",
adj.vars = NULL, cor.vars = NULL,
dag.banned = NULL, dag.retained = NULL,
max.parents = 1,
which.nodes = NULL, defn.res = NULL)
}) # ignore non-convergence warnings
mp.dag.mle <- mostProbable(score.cache = mycache.mle, verbose = FALSE)
myres.mle <- fitAbn(object = mp.dag.mle, method = "mle")
myres.sim <- simulateAbn(object = myres.mle,
run.simulation = TRUE,
bugsfile = NULL,
verbose = FALSE)
str(myres.sim)
prop.table(table(myres.sim$Outdoor))
prop.table(table(df$Outdoor))
Simulate a DAG with with arbitrary arcs density
Description
Provided with node names, returns an abnDAG
.
Arc density refers to the chance of a node being connected to the node before it.
Usage
simulateDag(node.name, data.dists = NULL, edge.density = 0.5, verbose = FALSE)
Arguments
node.name |
a vector of character giving the names of the nodes. It gives the size of the simulated DAG. |
data.dists |
named list corresponding to the |
edge.density |
number in |
verbose |
print more information on the run. |
Details
This function generates DAGs by sampling triangular matrices and reorder columns and rows randomly.
The network density (edge.density
) is used column-wise as binomial sampling probability.
Then the matrix is named using the user-provided names.
Value
object of class abnDag
consisting of a named matrix, a named list giving the distribution for each node and an empty element for the data.
Examples
simdag <- simulateDag(node.name = c("a", "b", "c", "d"),
edge.density = 0.5,
data.dists = list(a = "gaussian",
b = "binomial",
c = "poisson",
d = "multinomial"))
## Example using Ozon entries:
dist <- list(Ozone="gaussian", Solar.R="gaussian", Wind="gaussian",
Temp="gaussian", Month="gaussian", Day="gaussian")
out <- simulateDag(node.name = names(dist), data.dists = dist, edge.density = 0.8)
plot(out)
Computes skewness of a distribution
Description
Computes skewness of a distribution
Usage
skewness(x)
Arguments
x |
a numeric vector |
Value
integer
Standard Area Under the Marginal
Description
function to get std. are under marginal to exactly unity. It should be very close to unity but in some cases due to numerical accuracy differences (since each point is a separate estimate) this might be a little adrift turn this option off to see how reliable the original estimation is
Usage
std.area.under.grid(mylist, single)
Arguments
mylist |
list of matrices of two cols x, y |
single |
NULL or TRUE if only a single node and parameter |
Value
list
Recursive string splitting
Description
Internal function that call multiple times strsplit() and remove space
Usage
strsplits(x, splits, ...)
Value
A vector of strings
Prints summary statistics from an object of class abnDag
Description
Prints summary statistics from an object of class abnDag
Usage
## S3 method for class 'abnDag'
summary(object, ...)
Arguments
object |
an object of class |
... |
additional parameters. Not used at the moment. |
Value
List with summary statistics of the DAG.
Examples
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1))
summary(mydag)
Print summary of objects of class abnFit
Description
Print summary of objects of class abnFit
Usage
## S3 method for class 'abnFit'
summary(object, digits = 3L, ...)
Arguments
object |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
Value
prints summary statistics of the fitted model.
Print summary of objects of class abnMostprobable
Description
Print summary of objects of class abnMostprobable
Usage
## S3 method for class 'abnMostprobable'
summary(object, ...)
Arguments
object |
Object of class |
... |
additional parameters. Not used at the moment. |
Value
prints the mostprobable consensus DAG and the number of observations used to calculate it.
tidy up cache
Description
tidy up cache
Usage
tidy.cache(thecache)
Value
list of chache with error indexes removed
Convert a DAG into graphviz format
Description
Given a matrix defining a DAG create a text file suitable for plotting with graphviz.
Usage
toGraphviz(dag,
data.df=NULL,
data.dists=NULL,
group.var=NULL,
outfile=NULL,
directed=TRUE,
verbose=FALSE)
Arguments
dag |
a matrix defining a DAG. |
data.df |
a data frame containing the data used for learning the network. |
data.dists |
a list with named arguments matching the names of the data frame which gives the distribution family for each variable. See |
group.var |
only applicable for mixed models and gives the column name in |
outfile |
a character string giving the filename which will contain the graphviz graph. |
directed |
logical; if TRUE, a directed acyclic graph is produced, otherwise an undirected graph. |
verbose |
if TRUE more output is printed. If TRUE and 'outfile=NULL' the '.dot' file is printed to console. |
Details
Graphviz (https://www.graphviz.org) is a visualisation software developed by AT&T and freely available.
This function creates a text representation of the DAG, or the undirected graph, so this can be plotted using graphviz.
The R package, Rgraphviz
(available through the Bioconductor project https://www.bioconductor.org/) interfaces R and the working installation of graphviz.
Binary nodes will appear as squares, Gaussian as ovals and Poisson nodes as diamonds in the resulting graphviz network diagram. There are many other shapes possible for nodes and numerous other visual enhancements - see online graphviz documentation.
Bespoke refinements can be added by editing the raw outfile produced. For full manual editing, particularly of the layout, or adding annotations, one easy solution is to convert a postscript format graph (produced in graphviz using the -Tps switch) into a vector format using a tool such as pstoedit (http://www.pstoedit.net/), and then edit using a vector drawing tool like xfig. This can then be resaved as postscript or pdf thus retaining full vector quality.
Value
Nothing is returned, but a file outfile
written.
Author(s)
Fraser Iain Lewis
Marta Pittavino
Examples
## On a typical linux system the following code constructs a nice
## looking pdf file 'graph.pdf'.
## Not run:
## Subset of a build-in dataset
mydat <- ex0.dag.data[,c("b1","b2","b3","g1","b4","p2","p4")]
## setup distribution list for each node
mydists <- list(b1="binomial", b2="binomial", b3="binomial",
g1="gaussian", b4="binomial", p2="poisson",
p4="poisson")
## specify DAG model
mydag <- matrix(c( 0,1,0,0,1,0,0, #
0,0,0,0,0,0,0, #
0,1,0,0,1,0,0, #
1,0,0,0,0,0,1, #
0,0,0,0,0,0,0, #
0,0,0,1,0,0,0, #
0,0,0,0,1,0,0 #
), byrow=TRUE, ncol=7)
colnames(mydag) <- rownames(mydag) <- names(mydat)
## create file for processing with graphviz
outfile <- paste(tempdir(), "graph.dot", sep="/")
toGraphviz(dag=mydag, data.df=mydat, data.dists=mydists, outfile=outfile)
## and then process using graphviz tools e.g. on linux
if(Sys.info()[["sysname"]] == "Linux" && interactive()) {
system(paste( "dot -Tpdf -o graph.pdf", outfile))
system("evince graph.pdf")
}
## Example using data with a group variable where b1<-b2
mydag <- matrix(c(0,1, 0,0), byrow=TRUE, ncol=2)
colnames(mydag) <- rownames(mydag) <- names(ex3.dag.data[,c(1,2)])
## specific distributions
mydists <- list(b1="binomial", b2="binomial")
## create file for processing with graphviz
outfile <- paste0(tempdir(), "/graph.dot")
toGraphviz(dag=mydag, data.df=ex3.dag.data[,c(1,2,14)], data.dists=mydists,
group.var="group",
outfile=outfile, directed=FALSE)
## and then process using graphviz tools e.g. on linux:
if(Sys.info()[["sysname"]] == "Linux" && interactive()) {
pdffile <- paste0(tempdir(), "/graph.pdf")
system(paste("dot -Tpdf -o ", pdffile, outfile))
system(paste("evince ", pdffile, " &")) ## or some other viewer
}
## End(Not run)
Check for valid distribution
Description
The distribution names must match inla() family=''
.
Similar to get.var.types()
, mainly different in output.
Usage
validate_dists(data.dists, returnDists = TRUE, ...)
Arguments
data.dists |
list of variable distributions. |
returnDists |
if TRUE (default) returns the same list as provided. |
... |
additional arguments. |
Value
either TRUE/FALSE or list of variable distributions as provided as input.
simulated dataset from a DAG comprising of 33 variables
Description
250 observations simulated from a DAG with 17 binary variables and 16 continuous. A DAG of this data features in the vignette. Note that the conditional dependence relations given are those in the population and may differ in the realization of 250 observations.
Usage
var33
Format
A data frame with a mixture of discrete variables each of which is set as a factor and continuous variables. Joint distribution structure used to generate the data.
- v1
Binary, independent.
- v2
Gaussian, conditionally dependent upon v1.
- v3
Binary, independent.
- v4
Binary, conditionally dependent upon v3.
- v5
Gaussian, conditionally dependent upon v6.
- v6
Binary, conditionally dependent upon v4 and v7.
- v7
Gaussian, conditionally dependent upon v8.
- v8
Gaussian, conditionally dependent upon v9.
- v9
Binary, conditionally dependent upon v10.
- v10
Binary, independent.
- v11
Binary, conditionally dependent upon v10, v12 and v19.
- v12
Binary, independent.
- v13
Gaussian, independent.
- v14
Gaussian, conditionally dependent upon v13.
- v15
Binary, conditionally dependent upon v14 and v21.
- v16
Gaussian, independent.
- v17
Gaussian, conditionally dependent upon v16 and v20.
- v18
Binary, conditionally dependent upon v20.
- v19
Binary, conditionally dependent upon v20.
- v20
Binary, independent.
- v21
Binary, conditionally dependent upon v20.
- v22
Gaussian, conditionally dependent upon v21.
- v23
Gaussian, conditionally dependent upon v21.
- v24
Gaussian, conditionally dependent upon v23.
- v25
Gaussian, conditionally dependent upon v23 and v26.
- v26
Binary, conditionally dependent upon v20.
- v27
Binary, independent.
- v28
Binary, conditionally dependent upon v27, v29 and v31.
- v29
Gaussian, independent.
- v30
Gaussian, conditionally dependent upon v29.
- v31
Gaussian, independent.
- v32
Binary, conditionally dependent upon v21, v29 and v31.
- v33
Gaussian, conditionally dependent upon v31.
Examples
## Constructing the DAG of the dataset:
dag33 <- matrix(0, 33, 33)
dag33[2,1] <- 1
dag33[4,3] <- 1
dag33[6,4] <- 1; dag33[6,7] <- 1
dag33[5,6] <- 1
dag33[7,8] <- 1
dag33[8,9] <- 1
dag33[9,10] <- 1
dag33[11,10] <- 1; dag33[11,12] <- 1; dag33[11,19] <- 1;
dag33[14,13] <- 1
dag33[17,16] <- 1; dag33[17,20] <- 1
dag33[15,14] <- 1; dag33[15,21] <- 1
dag33[18,20] <- 1
dag33[19,20] <- 1
dag33[21,20] <- 1
dag33[22,21] <- 1
dag33[23,21] <- 1
dag33[24,23] <- 1
dag33[25,23] <- 1; dag33[25,26] <- 1
dag33[26,20] <- 1
dag33[33,31] <- 1
dag33[33,31] <- 1
dag33[32,21] <- 1; dag33[32,31] <- 1; dag33[32,29] <- 1
dag33[30,29] <- 1
dag33[28,27] <- 1; dag33[28,29] <- 1; dag33[28,31] <- 1