--- title: "Large-Scale Logistic Regression with lsReg" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Large-Scale Logistic Regression with lsReg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview This vignette demonstrates large-scale logistic regression using `lsReg`. The workflow is identical to the linear case: fit a base model once, allocate memory, then efficiently test each candidate covariate without re-fitting the base model. Here we use the likelihood ratio test (LRT) statistic. ## Example data We use the same simulated dataset as the linear regression vignette. The binary outcome `ylog` was generated from a logistic model with `x1`, `x2`, `z5`, and `z9` as predictors. The variables `z1` through `z10` are the candidates to be screened. ```{r load-data} library(lsReg) datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg") dat <- readRDS(datafile) head(dat[, c("ylog", "x1", "x2", "z1", "z2", "z5", "z9")]) ``` ## Step 1: Fit the base model ```{r base-model} basemdl <- glm(ylog ~ x1 + x2, data = dat, family = binomial) summary(basemdl) ``` ## Step 2: Allocate memory ```{r allocate} mem <- lsReg(basemdl, colstoadd = 1, testtype = "lrt") ``` ## Step 3: Test each candidate covariate After each call to `addcovar()` the parameter estimate is in `mem$fitdata$betab` and the LRT chi-square statistic is in `mem$testvalue`. ```{r run-tests} zvars <- paste0("z", 1:10) results <- data.frame( variable = zvars, estimate = NA_real_, statistic = NA_real_ ) for (i in seq_along(zvars)) { xr <- as.matrix(dat[, zvars[i], drop = FALSE]) addcovar(mem, xr) results$estimate[i] <- mem$fitdata$betab[1] results$statistic[i] <- mem$testvalue[1] } ``` ## Results The LRT statistic follows a chi-square distribution with 1 degree of freedom under the null. ```{r results} results$pvalue <- pchisq(results$statistic, df = 1, lower.tail = FALSE) print(results, digits = 4) ``` The variables `z5` and `z9` have the largest test statistics and the smallest p-values, consistent with the data-generating model. ## Verification against standard GLM We verify that `lsReg` matches a full `glm()` fit for `z5` and `z9`. The LRT statistic is the difference in deviance between the base model and the full model, which `anova()` reports directly. ```{r verify} glm_z5 <- glm(ylog ~ x1 + x2 + z5, data = dat, family = binomial) glm_z9 <- glm(ylog ~ x1 + x2 + z9, data = dat, family = binomial) glm_est_z5 <- coef(glm_z5)["z5"] glm_stat_z5 <- anova(basemdl, glm_z5, test = "LRT")$Deviance[2] glm_est_z9 <- coef(glm_z9)["z9"] glm_stat_z9 <- anova(basemdl, glm_z9, test = "LRT")$Deviance[2] lsreg_est_z5 <- results$estimate[results$variable == "z5"] lsreg_stat_z5 <- results$statistic[results$variable == "z5"] lsreg_est_z9 <- results$estimate[results$variable == "z9"] lsreg_stat_z9 <- results$statistic[results$variable == "z9"] comparison <- data.frame( variable = c("z5", "z9"), glm_estimate = c(glm_est_z5, glm_est_z9), lsreg_estimate = c(lsreg_est_z5, lsreg_est_z9), glm_statistic = c(glm_stat_z5, glm_stat_z9), lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9) ) print(comparison, digits = 6) ``` The estimates and LRT statistics from `lsReg` match those from `glm()` to numerical precision.