0.1 Bulk RNA-Seq Analysis with XYomics

Authors: Enrico Glaab and Sophie Le Bars

1 Introduction

This vignette demonstrates a standard workflow for analyzing bulk RNA-seq data to identify sex-specific effects using the XYomics package. The package provides a streamlined set of tools for differential expression, pathway analysis, and network-based interpretation.

This tutorial covers the following key steps:

  1. Simulating a bulk RNA-seq dataset with built-in sex-specific and shared gene expression changes.
  2. Discussing and applying two different strategies for differential expression analysis: sex-stratified analysis, and interaction analysis.
  3. Identifying and categorizing sex-specific genes into distinct groups (male-specific, female-specific, etc.) based on stratified analysis.
  4. Visualizing expression patterns of top genes using the package’s built-in plotting functions.
  5. Conducting pathway enrichment analysis to determine the biological functions of the identified gene sets.
  6. Constructing and visualizing a protein-protein interaction network to explore the interactions between key genes.

2 Data Simulation

We begin by simulating a bulk RNA-seq dataset to illustrate the package’s capabilities. The simulate_omics_data function creates a dataset with 500 genes and 100 samples, balanced across phenotype (Control/Disease) and sex (male/female).

The four categorized groups are: - Male-specific: Genes up-regulated in the disease state only in males. - Female-specific: Genes up-regulated in the disease state only in females. - Sex-dimorphic: Genes regulated in opposite directions between males and females in the disease state. - Sex-neutral: Genes regulated with same directions between males and females in the disease state.

library(XYomics)
library(dplyr)

# Load precomputed bulk dataset
expression_data <- readRDS(
  system.file("extdata", "bulk_expression.rds", package = "XYomics")
)

sex <- readRDS(
  system.file("extdata", "bulk_sex.rds", package = "XYomics")
)

phenotype <- readRDS(
  system.file("extdata", "bulk_phenotype.rds", package = "XYomics")
)

3 Differential Expression Analysis Strategies

When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.

Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. These may include classical parametric hypothesis tests, such as Welch’s test for normally distributed data, or non-parametric tests, such as the Mann–Whitney U test, as well as special moderated statistics for high-dimensional omics data analysis, such as the empirical Bayes moderated t-statistic. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.

Interaction analysis formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.

The XYomics package provides functions for both types of analyses.

3.1 Method 1: Sex-Stratified Analysis

Here, we use the sex_stratified_analysis_bulk() function to perform a stratified analysis.

res <- sex_stratified_analysis_bulk(expression_data, sex, phenotype)
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value is 3.
res <- readRDS(
  system.file("extdata", "bulk_results_degs.rds", package = "XYomics")
)
# Internal QC results from XYomics can be accessed through:
# res$validation
#
# This includes validation of sex and phenotype balance, group sample sizes,
# detection of design imbalances, and identification of groups below the
# minimum sample threshold required for reliable downstream analysis.
res$validation
## $valid
## [1] TRUE
## 
## $sex_counts
## sex_vec
##  F  M 
## 10 10 
## 
## $sex_ratio
## sex_vec
##   F   M 
## 0.5 0.5 
## 
## $imbalance
##     M 
## FALSE 
## 
## $ratio_difference
## M 
## 0 
## 
## $group_counts
## group
## WT.F TG.F WT.M TG.M 
##    5    5    5    5 
## 
## $low_sample_groups
## named integer(0)

4 Identification of Sex-Specific Genes (from Stratified Analysis)

The categorize_sex() function categorizes genes based on the results of the stratified analysis. This provides a useful, albeit potentially less robust, classification.

res_cat <- categorize_sex(res$male_DEGs, res$female_DEGs)
res_cat <- readRDS(
  system.file("extdata", "bulk_results_cat.rds", package = "XYomics")
)

cat("Top sex-stratified differentially expressed genes:\n")
## Top sex-stratified differentially expressed genes:
head(res_cat)
##                     DEG_Type Gene_Symbols Male_avg_logFC     Male_FDR
## male-specific1 male-specific      HNF4GP1     -2.3385776 7.877895e-20
## male-specific2 male-specific       CAVIN4      1.9039227 9.522511e-16
## male-specific3 male-specific       EPS8L2     -1.3881366 3.773010e-10
## male-specific4 male-specific        NFASC      1.0298699 5.079206e-07
## male-specific5 male-specific    AFAP1-AS2      0.9356591 4.506606e-06
## male-specific6 male-specific        WNT11      0.8978827 1.308807e-05
##                Female_avg_logFC Female_FDR
## male-specific1       0.12374127  0.5881039
## male-specific2      -0.02654086  0.9265521
## male-specific3       0.07858904  0.7320063
## male-specific4       0.09590189  0.6691739
## male-specific5      -0.11079054  0.6175147
## male-specific6       0.04508701  0.8562109
table(res_cat$DEG_Type)
## 
## female-specific   male-specific   sex-dimorphic     sex-neutral 
##               3              11               3              51

4.1 Method 2: Interaction Term Analysis

We use the sex_interaction_analysis_bulk() function to perform a formal interaction analysis.

interaction_term_results_bulk <- sex_interaction_analysis_bulk(expression_data, phenotype, sex)
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value ofr interaction analysis is 20.
interaction_term_results_bulk <- readRDS(
  system.file("extdata", "bulk_interaction_results.rds", package = "XYomics")
)

head(interaction_term_results_bulk$summary_stats)# signficant sex-modulated DEGs: interaction_term_results_bulk$sig_results
##   n_total_genes n_sig_genes voom_used
## 1           100          57     FALSE

5 Visualization of Expression Patterns

We use generate_violinplot_bulk() to visualize the expression of top genes identified by the stratified analysis.

top_male_gene <- res_cat %>%
  filter(DEG_Type == "male-specific") %>%
  arrange(Male_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)

top_female_gene <- res_cat %>%
  filter(DEG_Type == "female-specific") %>%
  arrange(Female_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)
top_dimorphic_gene <- res_cat %>%
  filter(DEG_Type == "sex-dimorphic") %>%
  arrange(Female_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)
top_neutral_gene <- res_cat %>%
  filter(DEG_Type == "sex-neutral") %>%
  arrange(Female_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)

top_genes = c(top_male_gene, top_female_gene, top_dimorphic_gene, top_neutral_gene )

generate_violinplot_bulk(expression_data,sex,phenotype , top_genes)

6 Pathway Enrichment Analysis

We perform pathway enrichment on the gene lists derived from the stratified analysis using categorized_enrich().

pathway_results <- categorized_enrich(res_cat, enrichment_db = "GO")

# The `enrichment_db` parameter can be set to "GO", "KEGG", "REACTOME".
# If the user wants to use a custom database, a TERM2GENE data frame must be
# provided via the `custom_db` parameter.
#
# Gene identifiers can be either any Gene identifier type supported by OrgDb
#' (e.g., "SYMBOL", "ENTREZID", "ENSEMBL", "UNIPROT")
# using the `gene_type` parameter.
#
# The `return_df` parameter controls the output format:
# if set to TRUE, results are returned as data frames;
# by default (FALSE), enrichResult objects are returned
pathway_results <- readRDS(
  system.file("extdata", "bulk_pathway_results.rds", package = "XYomics")
)

head(pathway_results)
## $`male-specific`
## #
## # over-representation test
## #
## #...@organism     Homo sapiens 
## #...@ontology     BP 
## #...@keytype      SYMBOL 
## #...@gene     chr [1:11] "HNF4GP1" "CAVIN4" "EPS8L2" "NFASC" "AFAP1-AS2" "WNT11" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...99 enriched terms found
## 'data.frame':    99 obs. of  12 variables:
##  $ ID            : chr  "GO:0090269" "GO:0090270" "GO:0003139" "GO:0048340" ...
##  $ Description   : chr  "fibroblast growth factor production" "regulation of fibroblast growth factor production" "secondary heart field specification" "paraxial mesoderm morphogenesis" ...
##  $ GeneRatio     : chr  "1/4" "1/4" "1/4" "1/4" ...
##  $ BgRatio       : chr  "10/18860" "10/18860" "11/18860" "11/18860" ...
##  $ RichFactor    : num  0.1 0.1 0.0909 0.0909 0.0714 ...
##  $ FoldEnrichment: num  472 472 429 429 337 ...
##  $ zScore        : num  21.7 21.7 20.7 20.7 18.3 ...
##  $ pvalue        : num  0.00212 0.00212 0.00233 0.00233 0.00297 ...
##  $ p.adjust      : num  0.0411 0.0411 0.0411 0.0411 0.0411 ...
##  $ qvalue        : num  0.00871 0.00871 0.00871 0.00871 0.00871 ...
##  $ geneID        : chr  "WNT11" "WNT11" "WNT11" "WNT11" ...
##  $ Count         : int  1 1 1 1 1 1 1 1 1 1 ...
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320 
## 
## 
## $`female-specific`
## #
## # over-representation test
## #
## #...@organism     Homo sapiens 
## #...@ontology     BP 
## #...@keytype      SYMBOL 
## #...@gene     chr [1:3] "HCG4" "MTRNR2L9" "TBC1D3B"
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...3 enriched terms found
## 'data.frame':    3 obs. of  12 variables:
##  $ ID            : chr  "GO:1900118" "GO:1900117" "GO:0097194"
##  $ Description   : chr  "negative regulation of execution phase of apoptosis" "regulation of execution phase of apoptosis" "execution phase of apoptosis"
##  $ GeneRatio     : chr  "1/1" "1/1" "1/1"
##  $ BgRatio       : chr  "21/18860" "40/18860" "96/18860"
##  $ RichFactor    : num  0.0476 0.025 0.0104
##  $ FoldEnrichment: num  898 472 196
##  $ zScore        : num  30 21.7 14
##  $ pvalue        : num  0.00111 0.00212 0.00509
##  $ p.adjust      : num  0.00318 0.00318 0.00509
##  $ qvalue        : logi  NA NA NA
##  $ geneID        : chr  "MTRNR2L9" "MTRNR2L9" "MTRNR2L9"
##  $ Count         : int  1 1 1
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320 
## 
## 
## $`sex-neutral`
## #
## # over-representation test
## #
## #...@organism     Homo sapiens 
## #...@ontology     BP 
## #...@keytype      SYMBOL 
## #...@gene     chr [1:51] "KLHL35" "TRIM17" "RPS26P44" "PP12613" "SETP12" "METTL16" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...0 enriched terms found
## #...Citation
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
# visualize via dotplot
plot <- plot_enrichment_dotplots(pathway_results)
print(plot)

$all

7 Protein-protein Interaction Network Analysis

Finally, we construct a protein-protein interaction network using the results from the stratified analysis.

7.0.1 1. Fetching the STRING Network

First, we download a protein-protein interaction network from the STRING database using get_string_network() or directly from the package.

# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)

# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # You can also download the hormonal network "hormonal_network_metacore.rds" instead of "string_example_network.rds".

7.0.2 2. Constructing the PCSF Network

Next, we define “prizes” for our genes based on their statistical significance (e.g., -log10 of the p-value) and use the construct_ppi_pcsf function to build a context-specific network.

# Use dimorphic DE results to define prizes
dimorphic_specific <- res_cat[res_cat$DEG_Type == "sex-dimorphic", ]
dimorphic_prizes <- -log10((dimorphic_specific$Male_FDR + dimorphic_specific$Female_FDR) / 2)
names(dimorphic_prizes) <-dimorphic_specific$Gene_Symbols

# Construct the PCSF subnetwork

dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)
network <- ppi_pipeline(res_cat, g)
network <- readRDS(
  system.file("extdata", "bulk_network_results.rds", package = "XYomics")
)

neutral_network <- readRDS(
  system.file("extdata", "bulk_neutral_network.rds", package = "XYomics")
)

neutral_specific <- res_cat[res_cat$DEG_Type == "sex-neutral", ]

7.0.3 3. Visualizing the Network

Finally, we use the plot_network() function to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.

plot_network(neutral_network, "sex-neutral", neutral_specific, show_barplot = T)

#Generate plots for all categories

all_plots <- plot_network_pipeline(network, res_cat)
## plot_network: Graph contains no nodes.

7.0.4 Generating a Report

The generate_cat_report function can be used to compile all results into a single HTML report.

# This command generates a comprehensive HTML report 

template_path = system.file("extdata", "Template_report_bulk.Rmd", package = "XYomics") # fetch the template for bulk from the package

generate_cat_report(res_cat, pathway_results, network, template_path = template_path)

8 Conclusion

This vignette has demonstrated two key approaches for analyzing sex-specific effects in bulk RNA-seq data. While sex-stratified analysis is a common first step, interaction analysis provides a more statistically robust method for identifying genes whose regulation by disease is truly dependent on sex. We recommend using an interaction-based approach when possible, while being mindful of the need for adequate sample sizes.