---
title: "Large-Scale Logistic Regression with lsReg"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Large-Scale Logistic Regression with lsReg}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

This vignette demonstrates large-scale logistic regression using `lsReg`. The
workflow is identical to the linear case: fit a base model once, allocate
memory, then efficiently test each candidate covariate without re-fitting the
base model. Here we use the likelihood ratio test (LRT) statistic.

## Example data

We use the same simulated dataset as the linear regression vignette. The binary
outcome `ylog` was generated from a logistic model with `x1`, `x2`, `z5`, and
`z9` as predictors. The variables `z1` through `z10` are the candidates to be
screened.

```{r load-data}
library(lsReg)

datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg")
dat <- readRDS(datafile)
head(dat[, c("ylog", "x1", "x2", "z1", "z2", "z5", "z9")])
```

## Step 1: Fit the base model

```{r base-model}
basemdl <- glm(ylog ~ x1 + x2, data = dat, family = binomial)
summary(basemdl)
```

## Step 2: Allocate memory

```{r allocate}
mem <- lsReg(basemdl, colstoadd = 1, testtype = "lrt")
```

## Step 3: Test each candidate covariate

After each call to `addcovar()` the parameter estimate is in `mem$fitdata$betab`
and the LRT chi-square statistic is in `mem$testvalue`.

```{r run-tests}
zvars <- paste0("z", 1:10)

results <- data.frame(
  variable  = zvars,
  estimate  = NA_real_,
  statistic = NA_real_
)

for (i in seq_along(zvars)) {
  xr <- as.matrix(dat[, zvars[i], drop = FALSE])
  addcovar(mem, xr)
  results$estimate[i]  <- mem$fitdata$betab[1]
  results$statistic[i] <- mem$testvalue[1]
}
```

## Results

The LRT statistic follows a chi-square distribution with 1 degree of freedom
under the null.

```{r results}
results$pvalue <- pchisq(results$statistic, df = 1, lower.tail = FALSE)
print(results, digits = 4)
```

The variables `z5` and `z9` have the largest test statistics and the smallest
p-values, consistent with the data-generating model.

## Verification against standard GLM

We verify that `lsReg` matches a full `glm()` fit for `z5` and `z9`. The LRT
statistic is the difference in deviance between the base model and the full
model, which `anova()` reports directly.

```{r verify}
glm_z5 <- glm(ylog ~ x1 + x2 + z5, data = dat, family = binomial)
glm_z9 <- glm(ylog ~ x1 + x2 + z9, data = dat, family = binomial)

glm_est_z5  <- coef(glm_z5)["z5"]
glm_stat_z5 <- anova(basemdl, glm_z5, test = "LRT")$Deviance[2]

glm_est_z9  <- coef(glm_z9)["z9"]
glm_stat_z9 <- anova(basemdl, glm_z9, test = "LRT")$Deviance[2]

lsreg_est_z5  <- results$estimate[results$variable == "z5"]
lsreg_stat_z5 <- results$statistic[results$variable == "z5"]

lsreg_est_z9  <- results$estimate[results$variable == "z9"]
lsreg_stat_z9 <- results$statistic[results$variable == "z9"]

comparison <- data.frame(
  variable        = c("z5", "z9"),
  glm_estimate    = c(glm_est_z5,  glm_est_z9),
  lsreg_estimate  = c(lsreg_est_z5, lsreg_est_z9),
  glm_statistic   = c(glm_stat_z5,  glm_stat_z9),
  lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9)
)
print(comparison, digits = 6)
```

The estimates and LRT statistics from `lsReg` match those from `glm()` to
numerical precision.