Authors: Enrico Glaab and Sophie Le Bars
This vignette demonstrates a standard workflow for analyzing bulk RNA-seq data to identify sex-specific effects using the XYomics package. The package provides a streamlined set of tools for differential expression, pathway analysis, and network-based interpretation.
This tutorial covers the following key steps:
We begin by simulating a bulk RNA-seq dataset to illustrate the package’s capabilities. The simulate_omics_data function creates a dataset with 500 genes and 100 samples, balanced across phenotype (Control/Disease) and sex (male/female).
The four categorized groups are: - Male-specific: Genes up-regulated in the disease state only in males. - Female-specific: Genes up-regulated in the disease state only in females. - Sex-dimorphic: Genes regulated in opposite directions between males and females in the disease state. - Sex-neutral: Genes regulated with same directions between males and females in the disease state.
library(XYomics)
library(dplyr)
# Load precomputed bulk dataset
expression_data <- readRDS(
system.file("extdata", "bulk_expression.rds", package = "XYomics")
)
sex <- readRDS(
system.file("extdata", "bulk_sex.rds", package = "XYomics")
)
phenotype <- readRDS(
system.file("extdata", "bulk_phenotype.rds", package = "XYomics")
)
When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.
Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. These may include classical parametric hypothesis tests, such as Welch’s test for normally distributed data, or non-parametric tests, such as the Mann–Whitney U test, as well as special moderated statistics for high-dimensional omics data analysis, such as the empirical Bayes moderated t-statistic. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.
Interaction analysis formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.
The XYomics package provides functions for both types of analyses.
Here, we use the sex_stratified_analysis_bulk() function to perform a stratified analysis.
res <- sex_stratified_analysis_bulk(expression_data, sex, phenotype)
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value is 3.
res <- readRDS(
system.file("extdata", "bulk_results_degs.rds", package = "XYomics")
)
# Internal QC results from XYomics can be accessed through:
# res$validation
#
# This includes validation of sex and phenotype balance, group sample sizes,
# detection of design imbalances, and identification of groups below the
# minimum sample threshold required for reliable downstream analysis.
res$validation
## $valid
## [1] TRUE
##
## $sex_counts
## sex_vec
## F M
## 10 10
##
## $sex_ratio
## sex_vec
## F M
## 0.5 0.5
##
## $imbalance
## M
## FALSE
##
## $ratio_difference
## M
## 0
##
## $group_counts
## group
## WT.F TG.F WT.M TG.M
## 5 5 5 5
##
## $low_sample_groups
## named integer(0)
The categorize_sex() function categorizes genes based on the results of the stratified analysis. This provides a useful, albeit potentially less robust, classification.
res_cat <- categorize_sex(res$male_DEGs, res$female_DEGs)
res_cat <- readRDS(
system.file("extdata", "bulk_results_cat.rds", package = "XYomics")
)
cat("Top sex-stratified differentially expressed genes:\n")
## Top sex-stratified differentially expressed genes:
head(res_cat)
## DEG_Type Gene_Symbols Male_avg_logFC Male_FDR
## male-specific1 male-specific HNF4GP1 -2.3385776 7.877895e-20
## male-specific2 male-specific CAVIN4 1.9039227 9.522511e-16
## male-specific3 male-specific EPS8L2 -1.3881366 3.773010e-10
## male-specific4 male-specific NFASC 1.0298699 5.079206e-07
## male-specific5 male-specific AFAP1-AS2 0.9356591 4.506606e-06
## male-specific6 male-specific WNT11 0.8978827 1.308807e-05
## Female_avg_logFC Female_FDR
## male-specific1 0.12374127 0.5881039
## male-specific2 -0.02654086 0.9265521
## male-specific3 0.07858904 0.7320063
## male-specific4 0.09590189 0.6691739
## male-specific5 -0.11079054 0.6175147
## male-specific6 0.04508701 0.8562109
table(res_cat$DEG_Type)
##
## female-specific male-specific sex-dimorphic sex-neutral
## 3 11 3 51
We use the sex_interaction_analysis_bulk() function to perform a formal interaction analysis.
interaction_term_results_bulk <- sex_interaction_analysis_bulk(expression_data, phenotype, sex)
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value ofr interaction analysis is 20.
interaction_term_results_bulk <- readRDS(
system.file("extdata", "bulk_interaction_results.rds", package = "XYomics")
)
head(interaction_term_results_bulk$summary_stats)# signficant sex-modulated DEGs: interaction_term_results_bulk$sig_results
## n_total_genes n_sig_genes voom_used
## 1 100 57 FALSE
We use generate_violinplot_bulk() to visualize the expression of top genes identified by the stratified analysis.
top_male_gene <- res_cat %>%
filter(DEG_Type == "male-specific") %>%
arrange(Male_FDR) %>%
pull(Gene_Symbols) %>%
head(1)
top_female_gene <- res_cat %>%
filter(DEG_Type == "female-specific") %>%
arrange(Female_FDR) %>%
pull(Gene_Symbols) %>%
head(1)
top_dimorphic_gene <- res_cat %>%
filter(DEG_Type == "sex-dimorphic") %>%
arrange(Female_FDR) %>%
pull(Gene_Symbols) %>%
head(1)
top_neutral_gene <- res_cat %>%
filter(DEG_Type == "sex-neutral") %>%
arrange(Female_FDR) %>%
pull(Gene_Symbols) %>%
head(1)
top_genes = c(top_male_gene, top_female_gene, top_dimorphic_gene, top_neutral_gene )
generate_violinplot_bulk(expression_data,sex,phenotype , top_genes)
We perform pathway enrichment on the gene lists derived from the stratified analysis using categorized_enrich().
pathway_results <- categorized_enrich(res_cat, enrichment_db = "GO")
# The `enrichment_db` parameter can be set to "GO", "KEGG", "REACTOME".
# If the user wants to use a custom database, a TERM2GENE data frame must be
# provided via the `custom_db` parameter.
#
# Gene identifiers can be either any Gene identifier type supported by OrgDb
#' (e.g., "SYMBOL", "ENTREZID", "ENSEMBL", "UNIPROT")
# using the `gene_type` parameter.
#
# The `return_df` parameter controls the output format:
# if set to TRUE, results are returned as data frames;
# by default (FALSE), enrichResult objects are returned
pathway_results <- readRDS(
system.file("extdata", "bulk_pathway_results.rds", package = "XYomics")
)
head(pathway_results)
## $`male-specific`
## #
## # over-representation test
## #
## #...@organism Homo sapiens
## #...@ontology BP
## #...@keytype SYMBOL
## #...@gene chr [1:11] "HNF4GP1" "CAVIN4" "EPS8L2" "NFASC" "AFAP1-AS2" "WNT11" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...99 enriched terms found
## 'data.frame': 99 obs. of 12 variables:
## $ ID : chr "GO:0090269" "GO:0090270" "GO:0003139" "GO:0048340" ...
## $ Description : chr "fibroblast growth factor production" "regulation of fibroblast growth factor production" "secondary heart field specification" "paraxial mesoderm morphogenesis" ...
## $ GeneRatio : chr "1/4" "1/4" "1/4" "1/4" ...
## $ BgRatio : chr "10/18860" "10/18860" "11/18860" "11/18860" ...
## $ RichFactor : num 0.1 0.1 0.0909 0.0909 0.0714 ...
## $ FoldEnrichment: num 472 472 429 429 337 ...
## $ zScore : num 21.7 21.7 20.7 20.7 18.3 ...
## $ pvalue : num 0.00212 0.00212 0.00233 0.00233 0.00297 ...
## $ p.adjust : num 0.0411 0.0411 0.0411 0.0411 0.0411 ...
## $ qvalue : num 0.00871 0.00871 0.00871 0.00871 0.00871 ...
## $ geneID : chr "WNT11" "WNT11" "WNT11" "WNT11" ...
## $ Count : int 1 1 1 1 1 1 1 1 1 1 ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
##
##
## $`female-specific`
## #
## # over-representation test
## #
## #...@organism Homo sapiens
## #...@ontology BP
## #...@keytype SYMBOL
## #...@gene chr [1:3] "HCG4" "MTRNR2L9" "TBC1D3B"
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...3 enriched terms found
## 'data.frame': 3 obs. of 12 variables:
## $ ID : chr "GO:1900118" "GO:1900117" "GO:0097194"
## $ Description : chr "negative regulation of execution phase of apoptosis" "regulation of execution phase of apoptosis" "execution phase of apoptosis"
## $ GeneRatio : chr "1/1" "1/1" "1/1"
## $ BgRatio : chr "21/18860" "40/18860" "96/18860"
## $ RichFactor : num 0.0476 0.025 0.0104
## $ FoldEnrichment: num 898 472 196
## $ zScore : num 30 21.7 14
## $ pvalue : num 0.00111 0.00212 0.00509
## $ p.adjust : num 0.00318 0.00318 0.00509
## $ qvalue : logi NA NA NA
## $ geneID : chr "MTRNR2L9" "MTRNR2L9" "MTRNR2L9"
## $ Count : int 1 1 1
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
##
##
## $`sex-neutral`
## #
## # over-representation test
## #
## #...@organism Homo sapiens
## #...@ontology BP
## #...@keytype SYMBOL
## #...@gene chr [1:51] "KLHL35" "TRIM17" "RPS26P44" "PP12613" "SETP12" "METTL16" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...0 enriched terms found
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
# visualize via dotplot
plot <- plot_enrichment_dotplots(pathway_results)
print(plot)
$all
Finally, we construct a protein-protein interaction network using the results from the stratified analysis.
First, we download a protein-protein interaction network from the STRING database using get_string_network() or directly from the package.
# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)
# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # You can also download the hormonal network "hormonal_network_metacore.rds" instead of "string_example_network.rds".
Next, we define “prizes” for our genes based on their statistical significance (e.g., -log10 of the p-value) and use the construct_ppi_pcsf function to build a context-specific network.
# Use dimorphic DE results to define prizes
dimorphic_specific <- res_cat[res_cat$DEG_Type == "sex-dimorphic", ]
dimorphic_prizes <- -log10((dimorphic_specific$Male_FDR + dimorphic_specific$Female_FDR) / 2)
names(dimorphic_prizes) <-dimorphic_specific$Gene_Symbols
# Construct the PCSF subnetwork
dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)
network <- ppi_pipeline(res_cat, g)
network <- readRDS(
system.file("extdata", "bulk_network_results.rds", package = "XYomics")
)
neutral_network <- readRDS(
system.file("extdata", "bulk_neutral_network.rds", package = "XYomics")
)
neutral_specific <- res_cat[res_cat$DEG_Type == "sex-neutral", ]
Finally, we use the plot_network() function to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.
plot_network(neutral_network, "sex-neutral", neutral_specific, show_barplot = T)
#Generate plots for all categories
all_plots <- plot_network_pipeline(network, res_cat)
## plot_network: Graph contains no nodes.
The generate_cat_report function can be used to compile all results into a single HTML report.
# This command generates a comprehensive HTML report
template_path = system.file("extdata", "Template_report_bulk.Rmd", package = "XYomics") # fetch the template for bulk from the package
generate_cat_report(res_cat, pathway_results, network, template_path = template_path)
This vignette has demonstrated two key approaches for analyzing sex-specific effects in bulk RNA-seq data. While sex-stratified analysis is a common first step, interaction analysis provides a more statistically robust method for identifying genes whose regulation by disease is truly dependent on sex. We recommend using an interaction-based approach when possible, while being mindful of the need for adequate sample sizes.