---
title: "Large-Scale Linear Regression with lsReg"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Large-Scale Linear Regression with lsReg}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

`lsReg` is designed for settings where a large number of candidate covariates
must each be tested for association with an outcome, given a fixed set of
baseline covariates. Rather than fitting a separate model for each candidate,
`lsReg` pre-computes and caches expensive quantities from the base model once,
then reuses them for every subsequent test. This is the pattern used in
genome-wide association studies (GWAS), where thousands or millions of genetic
variants are tested one at a time against a common base model.

The workflow has two steps:

1. **Allocate** — fit the base model with `glm()`, then call `lsReg()` to cache
   the quantities needed for repeated testing.
2. **Test** — call `addcovar()` once per candidate covariate, retrieving the
   parameter estimate and test statistic from the allocated memory object.

## Example data

We use a simulated dataset with 1,000 observations. The outcome `ylin` was
generated as a linear function of `x1`, `x2`, `z5`, and `z9`. The variables
`z1` through `z10` are candidate covariates to be screened.

```{r load-data}
library(lsReg)

datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg")
dat <- readRDS(datafile)
head(dat[, c("ylin", "x1", "x2", "z1", "z2", "z5", "z9")])
```

## Step 1: Fit the base model

The base model includes only the baseline covariates `x1` and `x2`.

```{r base-model}
basemdl <- glm(ylin ~ x1 + x2, data = dat, family = gaussian)
summary(basemdl)
```

## Step 2: Allocate memory

Pass the fitted base model to `lsReg()`, specifying that each test will add one
column (`colstoadd = 1`) and that the Wald test statistic should be used.

```{r allocate}
mem <- lsReg(basemdl, colstoadd = 1, testtype = "wald")
```

## Step 3: Test each candidate covariate

Loop over `z1` through `z10`, calling `addcovar()` for each. After each call the
parameter estimate for the candidate covariate is stored in
`mem$fitdata$betab` and the Wald test statistic in `mem$testvalue`.

```{r run-tests}
zvars <- paste0("z", 1:10)

results <- data.frame(
  variable  = zvars,
  estimate  = NA_real_,
  statistic = NA_real_
)

for (i in seq_along(zvars)) {
  xr <- as.matrix(dat[, zvars[i], drop = FALSE])
  addcovar(mem, xr)
  results$estimate[i]  <- mem$fitdata$betab[1]
  results$statistic[i] <- mem$testvalue[1, 1]
}
```

## Results

```{r results}
results$pvalue <- 2 * pnorm(-abs(results$statistic))
print(results, digits = 4)
```

The variables `z5` and `z9` have the largest test statistics and the smallest
p-values, consistent with the data-generating model. The remaining
variables are null and show statistics close to zero.

## Verification against standard GLM

To confirm that `lsReg` produces the same estimates and test statistics as a
full `glm()` fit, we fit the models `ylin ~ x1 + x2 + z5` and
`ylin ~ x1 + x2 + z9` directly and compare.

```{r verify}
glm_z5 <- glm(ylin ~ x1 + x2 + z5, data = dat, family = gaussian)
glm_z9 <- glm(ylin ~ x1 + x2 + z9, data = dat, family = gaussian)

glm_est_z5  <- coef(glm_z5)["z5"]
glm_stat_z5 <- coef(summary(glm_z5))["z5", "t value"]

glm_est_z9  <- coef(glm_z9)["z9"]
glm_stat_z9 <- coef(summary(glm_z9))["z9", "t value"]

lsreg_est_z5  <- results$estimate[results$variable == "z5"]
lsreg_stat_z5 <- results$statistic[results$variable == "z5"]

lsreg_est_z9  <- results$estimate[results$variable == "z9"]
lsreg_stat_z9 <- results$statistic[results$variable == "z9"]

comparison <- data.frame(
  variable  = c("z5", "z9"),
  glm_estimate  = c(glm_est_z5,  glm_est_z9),
  lsreg_estimate = c(lsreg_est_z5, lsreg_est_z9),
  glm_statistic  = c(glm_stat_z5,  glm_stat_z9),
  lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9)
)
print(comparison, digits = 6)
```

The estimates and test statistics from `lsReg` match those from `glm()` to
numerical precision.