Title: | Boltzmann Bayes Learner |
Version: | 1.0.0 |
Maintainer: | Jun Woo <junwoo035@gmail.com> |
Depends: | R (≥ 3.6.0) |
Imports: | methods, stats, utils, Rcpp (≥ 0.12.16), pROC, RColorBrewer |
Description: | Supervised learning using Boltzmann Bayes model inference, which extends naive Bayes model to include interactions. Enables classification of data into multiple response groups based on a large number of discrete predictors that can take factor values of heterogeneous levels. Either pseudo-likelihood or mean field inference can be used with L2 regularization, cross-validation, and prediction on new data. <doi:10.18637/jss.v101.i05>. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
LinkingTo: | Rcpp |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.2 |
NeedsCompilation: | yes |
Suggests: | glmnet, BiocManager, Biostrings |
Author: | Jun Woo |
Repository: | CRAN |
Packaged: | 2022-01-27 16:11:27 UTC; junwoo |
Date/Publication: | 2022-01-27 23:20:12 UTC |
Boltzmann Bayes Learning Inference
Description
Main driver for bbl inference
Usage
bbl(
formula,
data,
weights,
xlevels = NULL,
verbose = 1,
method = "pseudo",
novarOk = FALSE,
testNull = TRUE,
prior.count = 1,
...
)
Arguments
formula |
Formula for modeling |
data |
Data for fitting |
weights |
Vector of weights for each instance in data. Restricted to
non-negative integer frequencies, recoding the number of times
each row of data must be repeated. If |
xlevels |
List of factor levels for predictors. If |
verbose |
Output verbosity level. Will be send to down-stream function calls with one level lower |
method |
BB inference algorithm; pseudo-likelihood inference ( |
novarOk |
If |
testNull |
Repeat the inference for the ‘pooled’ sample; i.e., under the null hypothesis of all rows in data belonging to a single group |
prior.count |
Prior count for computing single predictor and pairwise frequencies |
... |
Other parameters to |
Details
Formula argument and data are used to tabulate xlevels unless explicitly
given as list. Data are expected to be factors or integers. This function
is a driver interepreting formula and calls bbi.fit
. Will stop with
error if any predictor has only one level unless novarOk='TRUE'
.
Use removeConst
to remove the non-varying predictors before
calling if this happens.
Value
A list of class bbl
with the following elements:
coefficients |
List of inferred coefficients with elements
|
xlevels |
List of vectors containing predictor levels. |
terms |
The |
groups |
Vector of response groups. |
groupname |
Name of the response variable. |
qJ |
Matrix of logicals whose elements record whether
|
model |
Model data frame derived from |
lkh |
Log likelihood. |
lz |
Vector log partition function. Used in |
weights |
Vector of integral weights (frequencies). |
call |
Function call. |
df |
Degrees of freedom. |
Author(s)
Jun Woo, junwoo035@gmail.com
References
Examples
titanic <- as.data.frame(Titanic)
b <- bbl(Survived ~ (Class + Sex + Age)^2, data = titanic, weights = Freq)
b
bbl Inference with model matrix
Description
Performs bbl inference using response vector and predictor matrix
Usage
bbl.fit(
x,
y,
qJ = NULL,
weights = NULL,
xlevels = NULL,
verbose = 1,
method = "pseudo",
prior.count = 1,
...
)
Arguments
x |
Data frame of factors with each predictor in columns. |
y |
Vector of response variables. |
qJ |
Matrix of logicals indicating which predictor combinations are interacting. |
weights |
Vector of non-negative integer frequencies, recoding
the number of times each row of data must be repeated.
If |
xlevels |
List of factor levels for predictors. If |
verbose |
Verbosity level of output. Will be propagated to
|
method |
|
prior.count |
Prior count for computing single predictor and pairwise frequencies |
... |
Other arguments to |
Details
This function would normally be called by bbl
rather than
directly. Expects the predictor data x
and response vector y
instead of formula input to bbl
.
Value
List of named components h
, J
, lkh
, and
lz
; see bbl
for information regarding these
components.
Examples
titanic <- as.data.frame(Titanic)
freq <- titanic$Freq
x <- titanic[,1:3]
y <- titanic$Survived
b <- bbl.fit(x=x,y=y, weights=freq)
b
Cross-Validation of BB Learning
Description
Run multiple fittings of bbl
model with training/validation
division of data
Usage
crossVal(
formula,
data,
weights,
novarOk = FALSE,
lambda = 1e-05,
lambdah = 0,
eps = 0.9,
nfold = 5,
method = "pseudo",
use.auc = TRUE,
verbose = 1,
progress.bar = FALSE,
storeOpt = TRUE,
...
)
Arguments
formula |
Formula for model. Note that intercept has no effect. |
data |
Data frame of data. Column names must match |
weights |
Frequency vector of how many times each row of |
novarOk |
Proceed even when there are predictors with only one factor level. |
lambda |
Vector of L2 penalizer values for |
lambdah |
L2 penalizer in |
eps |
Vector of regularization parameters, |
nfold |
Number of folds for training/validation split. |
method |
|
use.auc |
Use AUC as the measure of prediction accuracy. Only works
if response groups are binary. If |
verbose |
Verbosity level. Downgraded when relayed into |
progress.bar |
Display progress bar in |
storeOpt |
Store the optimal fitted object of class |
... |
Other parameters to |
Details
The data
slot of object
is split into training and validation
subsets of (nfold
-1):1 ratio. The model is trained with the
former and validated on the latter. Individual division/fold results are
combined into validation result for all instances in the data set and
prediction score is evaluated using the known response group
identity.
Value
Object of class cv.bbl
extending bbl
, a list
with extra components:
regstar
, Value of regularization parameter, lambda
and eps
for method='pseudo'
and method='mf'
,respectively,
at which the accuracy score is maximized;
maxscore
, Value of maximum accuracy;
cvframe
, Data frame of regularization parameters and scores scanned.
If use.auc=TRUE
, also contains 95
Examples
set.seed(513)
m <- 5
n <- 100
predictors <- list()
for(i in 1:m) predictors[[i]] <- c('a','c','g','t')
names(predictors) <- paste0('v',1:m)
par <- list(randompar(predictors), randompar(predictors, h0=0.1, J0=0.1))
dat <- randomsamp(predictors, response=c('ctrl','case'), par=par, nsample=n)
cv <- crossVal(y ~ .^2, data=dat, method='mf', eps=seq(0.1,0.9,0.1))
cv
Fitted Response Group Probabilities
Description
Response group probabilities from BBL fit
Usage
## S3 method for class 'bbl'
fitted(object, ...)
Arguments
object |
Object of class |
... |
Other arguments |
Details
This method returns predicted response group probabilities of trainig data
Value
Matrix of response group probabities with data points in rows and response groups in columns
Examples
titanic <- as.data.frame(Titanic)
fit <- bbl(Survived ~ Class + Sex + Age, data=titanic, weights=titanic$Freq)
Formula in BBL Fitting
Description
Returns the formula used in BBL fit
Usage
## S3 method for class 'bbl'
formula(x, ...)
Arguments
x |
Object of class |
... |
Other arguments |
Value
Formula object
Examples
titanic <- as.data.frame(Titanic)
fit <- bbl(Survived ~ Class + Sex + Age, data=titanic, weights=titanic$Freq)
formula(fit)
Convert Frequency Table into Raw Data
Description
Data with unique rows and a frequency column is converted into data with duplicate rows.
Usage
freq2raw(data, freq)
Arguments
data |
Data frame with factors in columns |
freq |
Vector of frequency of each row in |
Details
The ouput data frame can be used as input to bbl
.
Value
Data frame with one row per instances
Examples
Titanic
x <- as.data.frame(Titanic)
head(x)
titanic <- freq2raw(data=x[,1:3], freq=x$Freq)
head(titanic)
Log likelihood for bbl object
Description
Compute log likelihood from a fitted bbl
object
Usage
## S3 method for class 'bbl'
logLik(object, ...)
Arguments
object |
Object of class |
... |
Other arguments to methods |
Details
This method uses inferred parameters from calls to bbl
and data to compute the log likelihood.
Value
An object of class logLik
, the Log likelihood value
and the attribute "df" (degrees of freedom), the number of
parameters.
Sample Predictor Distributions
Description
Uses fitted BBL model to explore predictor distributions
Usage
mcSample(object, nsteps = 1000, verbose = 1, progress.bar = TRUE)
Arguments
object |
Object of class |
nsteps |
Total number of MC steps |
verbose |
Verbosity level of output |
progress.bar |
Display progress bar |
Details
After bbl
fit, the resulting model is used by this function to sample
predictor distributions in each response group and find the most likely
preditor set using MCMC.
Examples
titanic <- as.data.frame(Titanic)
b <- bbl(Survived~., data=titanic[,1:4], weights=titanic$Freq)
pxy <- mcSample(b)
pxy
Maximum likelihood estimate
Description
Perform inference of bias and interaction parameters for a single response group
Usage
mlestimate(
xi,
weights = NULL,
qJ = NULL,
method = "pseudo",
L = NULL,
lambda = 1e-05,
lambdah = 0,
symmetrize = TRUE,
eps = 0.9,
nprint = 100,
itmax = 1e+05,
tolerance = 1e-04,
verbose = 1,
prior.count = 1,
naive = FALSE,
lz.half = FALSE
)
Arguments
xi |
Data matrix; expected to be numeric with elements ranging from
zero to positive integral upper bound |
weights |
Frequency vector of number of times each row of |
qJ |
Matrix of logicals indicating which predictor pairs are
interacting. If |
method |
|
L |
Vector of number of factor levels in each predictor. If
|
lambda |
Vector of L2 regularization parameters for
|
lambdah |
L2 parameters for |
symmetrize |
Enforce the symmetry of interaction parameters by
taking mean values of the matrix and its trace:
|
eps |
Vector of regularization parameters for |
nprint |
Frequency of printing iteration progress under |
itmax |
Maximum number of iterations for |
tolerance |
Upper bound for fractional changes in pseduo-likelihood
values before termiating iteration in |
verbose |
Verbosity level. |
prior.count |
Prior count for |
naive |
Naive Bayes inference. Equivalent to |
lz.half |
Divide interaction term in approximation to |
Details
Given numeric data matrix, either pseudo-likelihood
of mean-field theory is used to find the maximum likelihood estimate
of bias h
and interaction J
parameters. Normally
called by bbl
rather than directly.
Value
List of inferred parameters h
and J
. See
bbl
for parameter structures.
Examples
set.seed(535)
predictors <- list()
for(i in 1:5) predictors[[i]] <- c('a','c','g','t')
par <- randompar(predictors)
par
xi <- sample_xi(nsample=5000, predictors=predictors, h=par$h, J=par$J,
code_out=TRUE)
head(xi)
ps <- mlestimate(xi=xi, method='pseudo', lambda=0)
ps$h
ps$J[[1]]
mf <- mlestimate(xi=xi, method='mf', eps=0.9)
plot(x=unlist(par$h), y=unlist(ps$h), xlab='True', ylab='Inferred')
segments(x0=-2, x1=2, y0=-2, y1=2, lty=2)
points(x=unlist(par$J), y=unlist(ps$J), col='red')
points(x=unlist(par$h), y=unlist(mf$h), col='blue')
points(x=unlist(par$J), y=unlist(mf$J), col='green')
Model Frame for BBL
Description
Returns the model frame used in BBL fit
Usage
## S3 method for class 'bbl'
model.frame(formula, ...)
Arguments
formula |
Object of class |
... |
Other arguments |
Value
Data frame used for fitting
Examples
titanic <- as.data.frame(Titanic)
fit <- bbl(Survived ~ Class + Sex + Age, data=titanic[,1:4], weights=titanic$Freq)
head(model.frame(fit))
Number of Observations in BBL Fit
Description
Returns the number of observations from a BBL fit
Usage
## S3 method for class 'bbl'
nobs(object, ...)
Arguments
object |
Object of class |
... |
Other arguments |
Value
An integer of number of observations
Examples
titanic <- as.data.frame(Titanic)
fit <- bbl(Survived ~ Class + Sex + Age, data=titanic[,1:4], weights=titanic$Freq)
nobs(fit)
Plot bbl object
Description
Visualize bias and interaction parameters
Usage
## S3 method for class 'bbl'
plot(x, layout = NULL, hcol = NULL, Jcol = NULL, npal = 100, ...)
Arguments
x |
Object of class |
layout |
Matrix of layouts for arrangment of linear and interaction
parameters. If |
hcol |
Color for linear barplots. Grayscale if |
Jcol |
Color for interaction heatmaps. Default ( |
npal |
Number of color scales. |
... |
Other graphical parameters for |
Details
This method displays a barplot of bias parameters and heatmaps (one per response group) of interaction parameters. All parameters are offset by the pooled values (single group inference) unless missing.
Plot Cross-validation Outcome
Description
Plot cross-validation score as a function of regularization parameter
Usage
## S3 method for class 'cv.bbl'
plot(
x,
type = "b",
log = "x",
pch = 21,
bg = "white",
xlab = NULL,
ylab = NULL,
las = 1,
...
)
Arguments
x |
Object of class |
type |
Symbol type in |
log |
Log scale argument to |
pch |
Symbol type code in |
bg |
Symbol background color in |
xlab |
X axis label |
ylab |
Y axis label |
las |
Orientation of axis labels in |
... |
Other arguments to |
Details
This function will plot accuracy score as a function of regularization parameter
from a call to crossVal
.
Predict Response Group Using bbl
Model
Description
Make prediction of response group identity based on trained model
Usage
## S3 method for class 'bbl'
predict(object, newdata, type = "link", verbose = 1, progress.bar = FALSE, ...)
Arguments
object |
Object of class |
newdata |
Data frame of new data for which prediction is to
be made. Columns must contain all of those in |
type |
Return value type. If |
verbose |
Verbosity level |
progress.bar |
Display progress of response group probability. Useful for large samples. |
... |
Other arguments to methods |
Details
This method uses a new data set for predictors and trained bbl
model
parameters to compute posterior probabilities of response group
identity.
Value
Data frame of predicted posterior probabilities with samples in rows and response groups in columns. The last column is the predicted response group with maximum probability.
Examples
set.seed(154)
m <- 5
L <- 3
n <- 1000
predictors <- list()
for(i in 1:m) predictors[[i]] <- seq(0,L-1)
names(predictors) <- paste0('v',1:m)
par <- list(randompar(predictors=predictors, dJ=0.5),
randompar(predictors=predictors, h0=0.1, J0=0.1, dJ=0.5))
dat <- randomsamp(predictors=predictors, response=c('ctrl','case'), par=par,
nsample=n)
dat <- dat[sample(n),]
dtrain <- dat[seq(n/2),]
dtest <- dat[seq(n/2+1,n),]
model <- bbl(y ~ .^2, data=dtrain)
pred <- predict(model, newdata=dtest)
score <- mean(dtest$y==pred$yhat)
score
auc <- pROC::roc(response=dtest$y, predictor=pred$case, direction='<')$auc
auc
Predict using Cross-validation Object
Description
Use the optimal fitted model from cross-validation run to make prediction
Usage
## S3 method for class 'cv.bbl'
predict(object, ...)
Arguments
object |
Object of class |
... |
Other parameters to |
Details
This method will use the fitted model with maximum accuracy score returned
by a call to crossVal
to make prediction on new data
Value
Data frame of prediction; see predict.bbl
.
Print Boltzmann Bayes Learning Fits
Description
This method displays model structure and first elements of coefficients
Usage
## S3 method for class 'bbl'
print(x, showcoeff = TRUE, maxcoeff = 3L, ...)
Arguments
x |
An object of class |
showcoeff |
Display first few fit coefficients |
maxcoeff |
Maximum number of coefficients to display |
... |
Further arguments passed to or from other methods |
Details
Displays the call to bbl
, response variable and its levels,
predictors and their levels, and the first few fit coefficients.
Display Cross-validation Result
Description
Print cross-validation optimal result and data frame
Usage
## S3 method for class 'cv.bbl'
print(x, ...)
Arguments
x |
Object of class |
... |
Other arguments to methods |
Details
This method prints crossVal
object with the optimal
regularization condition and maximum accuracy score on top and
the entire score profile as a data frame below.
Print Summary of Boltzmann Bayes Learning
Description
This method prints the summary of bbl
object
Usage
## S3 method for class 'summary.bbl'
print(x, ...)
Arguments
x |
Object of class |
... |
Other arguments to methods |
Details
The naive Bayes summary of summary.bbl
object is displayed.
Generate Random Parameters
Description
Random values of bias and interaction parameters are generated using either uniform or normal distributions.
Usage
randompar(predictors, distr = "unif", h0 = 0, dh = 1, J0 = 0, dJ = 1)
Arguments
predictors |
List of predictor factor levels. See |
distr |
|
h0 |
Mean of bias parameters |
dh |
|
J0 |
Mean of interaction parameters. |
dJ |
|
Details
Input argument predictors
is used to set up proper list
structures of parameters.
Value
List of parameters, h
and J
.
Examples
set.seed(311)
predictors <- list()
for(i in 1:5) predictors[[i]] <- c('a','c')
par <- randompar(predictors=predictors)
par
Generate Random Boltzmann Bayes Model Data
Description
Predictor-response paired data are generated
Usage
randomsamp(predictors, response, prob = NULL, par, nsample = 100)
Arguments
predictors |
List of vectors of predictor levels |
response |
Vector of response variables |
prob |
Vector of probabilities for sampling each response group |
par |
List of |
nsample |
Sample size |
Details
The argument response
is used to set up all possible levels
of response groups and likewise for predictors
. The parameter
argument par
must have the appropriate structure
consistent with response
and predictors
. This function
is a wrapper calling sample_xi
multiple times.
Value
Data frame of response and predictor variables.
Read FASTA File
Description
Read nucleotide sequence files in FASTA format
Usage
readFasta(file, rownames = FALSE)
Arguments
file |
File name of FASTA input. |
rownames |
Use the sequence annotation line in file (starts with
|
Details
Sequence data in FASTA files are converted into data frame
suitable as input to bbl
. If sequence lengths are different,
instances longer than those already read will be truncated. Empty sequences
are skipped.
Value
Data frame of each sequence in rows.
Examples
file <- tempfile('data')
write('>seq1', file)
write('atgcc', file, append=TRUE)
write('>seq2', file, append=TRUE)
write('gccaa', file, append=TRUE)
system(paste0('cat ',file))
x <- readFasta(file)
x
Remove Non-varying Predictors
Description
Constant predictor is identified and removed
Usage
removeConst(x)
Arguments
x |
Data frame containing discrete factor variables in each column |
Details
Variables with only one factor level is removed from data. Intended
for use before calling bbl
.
Value
Data frame omitting non-varying variables from x
.
Examples
set.seed(351)
nt <- c('a','c','g','t')
x <- data.frame(v1=sample(nt,size=50,replace=TRUE),
v2=rep('a',50),v3=sample(nt,size=50,replace=TRUE))
y <- sample(c('case','ctrl'),size=50,replace=TRUE)
dat <- cbind(data.frame(y=y), x)
summary(dat)
dat <- removeConst(dat)
summary(dat)
Residuals of BBL fit
Description
Binary-valued vector of fitted vs. true response group
Usage
## S3 method for class 'bbl'
residuals(object, ...)
Arguments
object |
Object of class |
... |
Other arguments |
Details
Discrete response group identity for each data point is compared with the fitted group and 0 (discordant) or 1 (concordant) is returned
Value
Vector binary values for each data point
Examples
titanic <- as.data.frame(Titanic)
dat <- freq2raw(titanic[,1:4], freq=titanic$Freq)
fit <- bbl(Survived ~ .^2, data=dat)
x <- residuals(fit)
table(x)
Generate Random Samples from Boltzmann Distribution
Description
Random samples are drawn from Boltzmann distribution
Usage
sample_xi(nsample = 1, predictors = NULL, h, J, code_out = FALSE)
Arguments
nsample |
Sample size |
predictors |
List of predictor factor levels. |
h |
Bias parameter; see |
J |
Interaction parameters; see |
code_out |
Ouput in integer codes; |
Details
All possible factor states are enumerated exhaustively using
input argument predictors
. If the number of predictors m
or the number of factor levels L_i
for each predictor i
are even moderately large (m\ge 10
or L_i\ge 5
),
this function will likely hang because the number of all possible
states grows exponentially.
Value
Data frame of samples in rows and predictors in columns.
Examples
set.seed(512)
m <- 5
n <- 1000
predictors <- list()
for(i in 1:m) predictors[[i]] <- c('a','c','g','t')
par <- randompar(predictors)
xi <- sample_xi(nsample=n, predictors=predictors, h=par$h, J=par$J)
head(xi)
Naive Bayes Summary
Description
Estimate significant of predictor-group association using naive Bayes model
Usage
## S3 method for class 'bbl'
summary(object, prior.count = 0, ...)
Arguments
object |
Object of class |
prior.count |
Prior count to be used for computing naive Bayes
coefficients and test results. If |
... |
Other arguments to methods. |
Details
This summary.bbl
method gives a rough overview of associations
within a bbl
fit object via naive Bayes coefficients and test
p-values. Note that naive Bayes results displayed ignore interactions
even when interactions are present in the model being displayed. This
feature is because simple analytic results exist for naive Bayes
coefficients and test p-values. The likelihood ratio test is with respect
to the null hypothesis that coefficients are identical for all response
groups.
Value
Object of class summary.bbl
extending bbl
class;
a list with extra components
h |
List of bias coefficients of response groups under naive Bayes approximation |
h0 |
Bias coefficients of pooled group under naive Bayes |
chisqNaive |
Vector of chi-square statistics for likelihood ratio test for each predictor |
dfNaive |
Vector of degrees of freedom for likelihood ratio test for each predictor |
pvNaive |
Vector p-values for each predictor |
Weights in BBL Fit
Description
This method returns weights used in BBL fit.
Usage
## S3 method for class 'bbl'
weights(object, ...)
Arguments
object |
Object of class |
... |
Other arguments |
Details
Note that weithts are integral
frequency values specifying repeat number of each instance in bbl
.
If no weights were used (default of 1s), NULL
is returned.
Value
Vector of weights for each instance