--- title: "Large-Scale Linear Regression with lsReg" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Large-Scale Linear Regression with lsReg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview `lsReg` is designed for settings where a large number of candidate covariates must each be tested for association with an outcome, given a fixed set of baseline covariates. Rather than fitting a separate model for each candidate, `lsReg` pre-computes and caches expensive quantities from the base model once, then reuses them for every subsequent test. This is the pattern used in genome-wide association studies (GWAS), where thousands or millions of genetic variants are tested one at a time against a common base model. The workflow has two steps: 1. **Allocate** — fit the base model with `glm()`, then call `lsReg()` to cache the quantities needed for repeated testing. 2. **Test** — call `addcovar()` once per candidate covariate, retrieving the parameter estimate and test statistic from the allocated memory object. ## Example data We use a simulated dataset with 1,000 observations. The outcome `ylin` was generated as a linear function of `x1`, `x2`, `z5`, and `z9`. The variables `z1` through `z10` are candidate covariates to be screened. ```{r load-data} library(lsReg) datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg") dat <- readRDS(datafile) head(dat[, c("ylin", "x1", "x2", "z1", "z2", "z5", "z9")]) ``` ## Step 1: Fit the base model The base model includes only the baseline covariates `x1` and `x2`. ```{r base-model} basemdl <- glm(ylin ~ x1 + x2, data = dat, family = gaussian) summary(basemdl) ``` ## Step 2: Allocate memory Pass the fitted base model to `lsReg()`, specifying that each test will add one column (`colstoadd = 1`) and that the Wald test statistic should be used. ```{r allocate} mem <- lsReg(basemdl, colstoadd = 1, testtype = "wald") ``` ## Step 3: Test each candidate covariate Loop over `z1` through `z10`, calling `addcovar()` for each. After each call the parameter estimate for the candidate covariate is stored in `mem$fitdata$betab` and the Wald test statistic in `mem$testvalue`. ```{r run-tests} zvars <- paste0("z", 1:10) results <- data.frame( variable = zvars, estimate = NA_real_, statistic = NA_real_ ) for (i in seq_along(zvars)) { xr <- as.matrix(dat[, zvars[i], drop = FALSE]) addcovar(mem, xr) results$estimate[i] <- mem$fitdata$betab[1] results$statistic[i] <- mem$testvalue[1, 1] } ``` ## Results ```{r results} results$pvalue <- 2 * pnorm(-abs(results$statistic)) print(results, digits = 4) ``` The variables `z5` and `z9` have the largest test statistics and the smallest p-values, consistent with the data-generating model. The remaining variables are null and show statistics close to zero. ## Verification against standard GLM To confirm that `lsReg` produces the same estimates and test statistics as a full `glm()` fit, we fit the models `ylin ~ x1 + x2 + z5` and `ylin ~ x1 + x2 + z9` directly and compare. ```{r verify} glm_z5 <- glm(ylin ~ x1 + x2 + z5, data = dat, family = gaussian) glm_z9 <- glm(ylin ~ x1 + x2 + z9, data = dat, family = gaussian) glm_est_z5 <- coef(glm_z5)["z5"] glm_stat_z5 <- coef(summary(glm_z5))["z5", "t value"] glm_est_z9 <- coef(glm_z9)["z9"] glm_stat_z9 <- coef(summary(glm_z9))["z9", "t value"] lsreg_est_z5 <- results$estimate[results$variable == "z5"] lsreg_stat_z5 <- results$statistic[results$variable == "z5"] lsreg_est_z9 <- results$estimate[results$variable == "z9"] lsreg_stat_z9 <- results$statistic[results$variable == "z9"] comparison <- data.frame( variable = c("z5", "z9"), glm_estimate = c(glm_est_z5, glm_est_z9), lsreg_estimate = c(lsreg_est_z5, lsreg_est_z9), glm_statistic = c(glm_stat_z5, glm_stat_z9), lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9) ) print(comparison, digits = 6) ``` The estimates and test statistics from `lsReg` match those from `glm()` to numerical precision.