Type: | Package |
Title: | Statistical Methods for Microbiome Compositional Data |
Version: | 1.2 |
Date: | 2024-03-13 |
Author: | Xianyang Zhang [aut], Jun Chen [aut, cre], Huijuan Zhou [ctb] |
Maintainer: | Jun Chen <chen.jun2@mayo.edu> |
Description: | A suite of methods for powerful and robust microbiome data analysis addressing zero-inflation, phylogenetic structure and compositional effects (Zhou et al. (2022)<doi:10.1186/s13059-022-02655-5>). The methods can be applied to the analysis of other (high-dimensional) compositional data arising from sequencing experiments. |
Depends: | R (≥ 3.5.0) |
Imports: | ggplot2, matrixStats, parallel, stats, utils, Matrix, statmod, MASS, ggrepel, lmerTest, foreach, modeest |
NeedsCompilation: | yes |
License: | GPL-3 |
Encoding: | UTF-8 |
Packaged: | 2024-04-01 22:00:40 UTC; m123485 |
Repository: | CRAN |
Date/Publication: | 2024-04-01 22:30:02 UTC |
Linear (Lin) Model for Differential Abundance (DA) Analysis of High-dimensional Compositional Data
Description
The function implements a simple, robust and highly scalable approach to tackle the compositional effects in differential abundance analysis of high-dimensional compositional data. It fits linear regression models on the centered log2-ratio transformed data, identifies a bias term due to the transformation and compositional effect, and corrects the bias using the mode of the regression coefficients. It could fit mixed-effect models for analysis of correlated data.
Usage
linda(
feature.dat,
meta.dat,
formula,
feature.dat.type = c('count', 'proportion'),
prev.filter = 0,
mean.abund.filter = 0,
max.abund.filter = 0,
is.winsor = TRUE,
outlier.pct = 0.03,
adaptive = TRUE,
zero.handling = c('pseudo-count', 'imputation'),
pseudo.cnt = 0.5,
corr.cut = 0.1,
p.adj.method = "BH",
alpha = 0.05,
n.cores = 1,
verbose = TRUE
)
Arguments
feature.dat |
a matrix of counts/proportions, row - features (OTUs, genes, etc) , column - samples. |
meta.dat |
a data frame containing the sample meta data. If there are NAs, the corresponding samples will be removed in the analysis. |
formula |
a character string for the formula. The formula should conform to that used by |
feature.dat.type |
the type of the feature data. It could be "count" or "proportion". |
prev.filter |
the prevalence (percentage of non-zeros) cutoff, under which the features will be filtered. The default is 0. |
mean.abund.filter |
the mean relative abundance cutoff, under which the features will be filtered. The default is 0. |
max.abund.filter |
the max relative abundance cutoff, under which the features will be filtered. The default is 0. |
is.winsor |
a logical value indicating whether winsorization should be performed to replace outliers (high values). The default is TRUE. |
outlier.pct |
the expected percentage of outliers. These outliers will be winsorized. The default is 0.03. |
adaptive |
a logical value indicating whether the approach to handle zeros (pseudo-count or imputation)
will be determined based on the correlations between the log(sequencing depth) and the explanatory variables
in |
zero.handling |
a character string of 'pseudo-count' or 'imputation' indicating the zero handling method
used when |
pseudo.cnt |
a positive numeric value for the pseudo-count to be added if |
corr.cut |
a numerical value between 0 and 1, indicating the significance level used for determining
the zero-handling approach when |
p.adj.method |
a character string indicating the p-value adjustment approach for
addressing multiple testing. See R function |
alpha |
a numerical value between 0 and 1 indicating the significance level for declaring differential features. Default is 0.05. |
n.cores |
a positive integer. If |
verbose |
a logical value indicating whether the trace information should be printed out. |
Value
A list with the elements
variables |
A vector of variable names of all fixed effects in |
bias |
numeric vector; each element corresponds to one variable in |
output |
a list of data frames with columns 'baseMean', 'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj', 'reject',
'df';
|
covariance |
a list of data frames; the data frame records the covariances between a regression coefficient with other coefficients;
|
otu.tab.use |
the OTU table used in the abundance analysis (the |
meta.use |
the meta data used in the abundance analysis (only variables in |
wald |
a list for use in Wald test. If the fitting model is a linear model, then it includes
If the fitting model is a linear mixed-effect model, then it includes
|
Author(s)
Huijuan Zhou, Jun Chen, Xianyang Zhang
References
Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.
Examples
data(smokers)
ind <- smokers$meta$AIRWAYSITE == 'Throat'
otu.tab <- as.data.frame(smokers$otu[, ind])
depth <- colSums(otu.tab)
meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]),
Sex = factor(smokers$meta$SEX[ind]),
Site = factor(smokers$meta$SIDEOFBODY[ind]),
SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind]))
# Differential abundance analysis using the left throat data
ind1 <- meta$Site == 'Left' & depth >= 1000
linda.obj <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex',
feature.dat.type = 'count',
prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
p.adj.method = "BH", alpha = 0.1)
linda.plot(linda.obj, c('Smokey', 'Sexmale'),
titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'),
alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL,
width = 11, height = 8)
rownames(linda.obj $output[[1]])[which(linda.obj $output[[1]]$reject)]
# Differential abundance analysis pooling both the left and right throat data
# Mixed effects model is used
ind <- depth >= 1000
linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)',
feature.dat.type = 'count',
prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
p.adj.method = "BH", alpha = 0.1)
# For proportion data
otu.tab.p <- t(t(otu.tab) / colSums(otu.tab))
ind1 <- meta$Site == 'Left' & depth >= 1000
lind.obj <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex',
feature.dat.type = 'proportion',
prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
p.adj.method = "BH", alpha = 0.1)
Plot LinDA Results
Description
The function produces the effect size plot of the differential features and volcano plot based on the output from linda
.
Usage
linda.plot(
linda.obj,
variables.plot,
titles = NULL,
alpha = 0.05,
lfc.cut = 1,
legend = FALSE,
directory = NULL,
width = 11,
height = 8
)
Arguments
linda.obj |
return from function |
variables.plot |
vector; variables whose results are to be plotted. For example, suppose the return
value |
titles |
vector; titles of the effect size plot and volcano plot for each variable in |
alpha |
a numerical value between 0 and 1; cutoff for |
lfc.cut |
a positive numerical value; cutoff for |
legend |
TRUE or FALSE; whether to show the legends of the effect size plot and volcano plot. |
directory |
character; the directory to save the figures, e.g., |
width |
the width of the graphics region in inches. See R function |
height |
the height of the graphics region in inches. See R function |
Value
A list of ggplot2
objects.
plot.lfc |
a list of effect size plots. Each plot corresponds to one variable in |
plot.volcano |
a list of volcano plots. Each plot corresponds to one variable in |
Author(s)
Huijuan Zhou, Jun Chen, Xianyang Zhang
References
Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.
Examples
data(smokers)
ind <- smokers$meta$AIRWAYSITE == 'Throat' & smokers$meta$SIDEOFBODY == 'Left'
otu.tab <- as.data.frame(smokers$otu[, ind])
depth <- colSums(otu.tab)
meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]),
Sex = factor(smokers$meta$SEX[ind]))
ind <- depth >= 1000
linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex',
feature.dat.type = 'count',
prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
p.adj.method = "BH", alpha = 0.1)
linda.plot(linda.obj, c('Smokey', 'Sexmale'),
titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'),
alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL,
width = 11, height = 8)
Wald test for bias-corrected regression coefficients
Description
The function implements Wald test for bias-corrected regression coefficients generated from the linda
function.
One can utilize the function to perform ANOVA-type analyses. For example, if you have a cateogrical variable with three levels, you can test whether all levels have the same effect.
Usage
linda.wald.test(
linda.obj,
L,
model = c("LM", "LMM"),
alpha = 0.05,
p.adj.method = "BH"
)
Arguments
linda.obj |
return from the |
L |
A matrix for testing |
model |
|
alpha |
significance level for testing |
p.adj.method |
P-value adjustment approach. See R function |
Value
A data frame with columns
Fstat |
Wald statistics for each taxon |
df1 |
The numerator degrees of freedom |
df2 |
The denominator degrees of freedom |
pvalue |
|
padj |
|
reject |
|
Author(s)
Huijuan Zhou huijuanzhou2019@gmail.com Jun Chen Chen.Jun2@mayo.edu Xianyang Zhang zhangxiany@stat.tamu.edu
References
Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.
Examples
data(smokers)
ind <- smokers$meta$AIRWAYSITE == 'Throat'
otu.tab <- as.data.frame(smokers$otu[, ind])
depth <- colSums(otu.tab)
meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]),
Sex = factor(smokers$meta$SEX[ind]),
Site = factor(smokers$meta$SIDEOFBODY[ind]),
SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind]))
ind <- depth >= 1000
linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)',
feature.dat.type = 'count',
prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
p.adj.method = "BH", alpha = 0.1)
# L matrix (2x3) is designed to test the second (Smoke) and the third (Sex) coefficient to be 0.
# For a categorical variable > two levels, similar L can be designed to do ANOVA-type test.
L <- matrix(c(0, 1, 0, 0, 0, 1), nrow = 2, byrow = TRUE)
result <- linda.wald.test(linda.obj, L, 'LMM', alpha = 0.1)
Microbiome data from the human upper respiratory tract (left and right throat)
Description
A dataset containing "otu", "tax", meta", "genus", family"
Usage
data(smokers)
Format
A list with elements
- otu
otu table, 2156 taxa by 290 samples
- tax
taxonomy table, 2156 taxa by 7 taxonomic ranks
- meta
meta table, 290 samples by 53 sample variables
- genus
304 by 290
- family
113 by 290
Source
https://qiita.ucsd.edu/ study ID:524 Reference: Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, et al. (2010) Disordered Microbial Communities in the Upper Respiratory Tract of Cigarette Smokers. PLoS ONE 5(12): e15216.