--- title: "Introduction to ReproStat" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to ReproStat} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) set.seed(20260307) library(ReproStat) ``` ## Overview **ReproStat** provides tools for diagnosing the reproducibility of statistical modeling workflows. The central idea is to perturb the data in small ways (bootstrap, subsampling, or noise) and measure how much model outputs change across perturbations. Stable outputs imply high reproducibility; volatile outputs flag potential concerns. The package computes: - **Coefficient stability** - variance of estimates across perturbations. - **P-value stability** - frequency of significant findings. - **Selection stability** - frequency of variable selection behavior. - **Prediction stability** - variance of predictions. - **Reproducibility Index (RI)** - a composite 0-100 score. ## Basic workflow ### Step 1: Run diagnostics ```{r run-diag} mt_diag <- run_diagnostics( mpg ~ wt + hp + disp, data = mtcars, B = 200, method = "bootstrap" ) mt_diag ``` ### Step 2: Inspect individual metrics ```{r metrics} coef_stability(mt_diag) pvalue_stability(mt_diag) selection_stability(mt_diag) prediction_stability(mt_diag)$mean_variance ``` ### Step 3: Compute the reproducibility index ```{r ri} reproducibility_index(mt_diag) ``` ### Step 4: Visualize ```{r plots} oldpar <- par(mfrow = c(1, 2)) plot_stability(mt_diag, "coefficient") plot_stability(mt_diag, "selection") par(oldpar) ``` ## Perturbation methods Three perturbation strategies are supported: ```{r methods, eval = FALSE} # Bootstrap resampling (default) run_diagnostics(mpg ~ wt + hp, mtcars, method = "bootstrap") # Subsampling (80 % of rows without replacement) run_diagnostics(mpg ~ wt + hp, mtcars, method = "subsample", frac = 0.8) # Gaussian noise injection (5 % of each column's SD) run_diagnostics(mpg ~ wt + hp, mtcars, method = "noise", noise_sd = 0.05) ``` ## Model comparison with CV ranking stability Use `cv_ranking_stability()` to compare multiple candidate models across repeated cross-validation runs. ```{r cv} models <- list( baseline = mpg ~ wt + hp + disp, compact = mpg ~ wt + hp, expanded = mpg ~ wt + hp + disp + qsec ) cv_result <- cv_ranking_stability(models, mtcars, v = 5, R = 40) cv_result$summary ``` ```{r cv-plot} plot_cv_stability(cv_result, metric = "top1_frequency") ``` ## Working with other datasets The package is dataset-agnostic. Any data frame with a numeric response and numeric predictors works: ```{r iris-example} iris_diag <- run_diagnostics( Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris, B = 150, method = "noise", noise_sd = 0.03 ) reproducibility_index(iris_diag) ``` For datasets with missing values, subset to complete cases first: ```{r airquality-example} aq <- na.omit(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")]) aq_diag <- run_diagnostics( Ozone ~ Solar.R + Wind + Temp, data = aq, B = 150, method = "bootstrap" ) reproducibility_index(aq_diag) ``` ## Uncertainty in the RI `ri_confidence_interval()` resamples the stored perturbation draws to estimate uncertainty in the RI without refitting any models. ```{r ci} set.seed(20260307) d <- run_diagnostics(mpg ~ wt + hp + disp, mtcars, B = 150) ri_confidence_interval(d, level = 0.95, R = 500) ``` ## Modeling backends ReproStat supports four backends. All use the same API; only the `backend` argument changes. ```{r backends, eval = FALSE} # Generalized linear model (logistic) run_diagnostics( am ~ wt + hp, mtcars, B = 100, family = stats::binomial() ) # Robust regression (requires MASS) if (requireNamespace("MASS", quietly = TRUE)) { run_diagnostics(mpg ~ wt + hp, mtcars, B = 100, backend = "rlm") } # LASSO (requires glmnet) if (requireNamespace("glmnet", quietly = TRUE)) { run_diagnostics( mpg ~ wt + hp + disp + qsec, mtcars, B = 100, backend = "glmnet", en_alpha = 1 ) } ``` ## ggplot2 helpers If **ggplot2** is installed, `plot_stability_gg()` and `plot_cv_stability_gg()` return `ggplot` objects that can be further customized. ```{r ggplot-helpers, eval = requireNamespace("ggplot2", quietly = TRUE)} if (requireNamespace("ggplot2", quietly = TRUE)) { set.seed(1) d_gg <- run_diagnostics(mpg ~ wt + hp + disp, mtcars, B = 50) print(plot_stability_gg(d_gg, "coefficient")) print(plot_stability_gg(d_gg, "selection")) } ``` ## Where to go next After this introduction, the most useful follow-up articles are: - **Interpreting ReproStat Outputs** for guidance on reading the metrics and the RI in applied work - **Backend Guide** for choosing among `lm`, `glm`, `rlm`, and `glmnet` - **Workflow Patterns** for practical analysis templates