The Survey of Consumer Finances (SCF) is a triennial survey of U.S. household finances conducted by the Federal Reserve Board. It is among the most detailed data sources on U.S. household wealth, income, and financial behavior.
Valid estimation from the SCF requires handling two design features:
The scf package handles both. It wraps the five
implicates in replicate-weighted survey::svrepdesign()
objects, applies estimation routines across implicates, and pools
results using Rubin’s Rules or equivalent procedures.
This vignette demonstrates the core workflow. For methodological background, see the package paper.
Download SCF data and load it into a multiply-imputed survey object
using scf_download() and scf_load().
The result is a scf_mi_survey object containing five
svyrep.design objects, one per implicate, and survey-year
metadata.
scf_update() adds or modifies variables uniformly across
all five implicates. Bottom-code skewed variables before logging to
avoid log(0).
scf2022 <- scf_update(scf2022,
senior = age >= 65,
female = factor(hhsex, levels = 1:2, labels = c("Male", "Female")),
rich = networth > 1e6,
networth = ifelse(networth > 1, networth, 1),
log_networth = log(networth),
income = ifelse(income > 1, income, 1),
log_income = log(income),
npeople = x101
)Use names(scf2022$mi_design[[1]]$variables) to inspect
available variables.
When a transformation depends on the distribution within each
implicate — ranks, percentile flags, groupwise z-scores — use
scf_update_by_implicate() instead:
scf_mean(), scf_median(), and
scf_percentile() return pooled population estimates with
standard errors. The by argument produces group-level
estimates.
scf_mean(scf2022, ~networth, by = ~senior)
#> Multiply-Imputed, Replicate-Weighted Mean Estimate
#>
#> group variable estimate se min max
#> FALSE networth 809777.2 135477.9 771184 821626.9
#> TRUE networth 1595858.8 483053.9 1355046 1812726.0
scf_median(scf2022, ~income, by = ~female)
#> Multiply-Imputed Median Estimate
#>
#> group variable quantile estimate se min max
#> Male income 0.5 85824.40 6488.425 85392.03 86472.95
#> Female income 0.5 49721.94 5304.311 49721.94 49721.94
scf_percentile(scf2022, ~networth, q = 0.9)
#> SCF Percentile Estimate
#>
#> SCF Percentile Estimate (SCF Bulletin convention)
#>
#> variable quantile estimate se min max
#> networth 0.9 1197722 393453.4 1039600 1360000
scf_percentile(scf2022, ~networth, q = 0.75, by = ~female)
#> SCF Percentile Estimate
#>
#> SCF Percentile Estimate (SCF Bulletin convention)
#>
#> group variable quantile estimate se min max
#> Male networth 0.75 642660 192745.73 623800 698700
#> Female networth 0.75 238504 56917.53 237680 241800Discrete distributions are summarized with scf_freq()
and cross-tabulated with scf_xtab().
scf_freq(scf2022, ~senior)
#> SCF Frequency Table (Pooled Results)
#>
#> group category proportion se_proportion
#> NA FALSE 75.77111 3.135819
#> NA TRUE 24.22889 3.135819
scf_xtab(scf2022, ~senior, ~female, scale = "col")
#> SCF Cross-Tabulation
#> Row Variable: senior | Column Variable: female
#> Displayed as: col proportions (percent)
#>
#> female: Female female: Male
#> senior: FALSE 78.51 70.60
#> senior: TRUE 21.49 29.40scf_ttest() and scf_prop_test() produce
pooled inferential tests with correct degrees of freedom.
scf_ttest(scf2022, ~networth, mu = 250000)
#> SCF One-sample t-test
#> Alternative hypothesis: mean is not equal to 250000
#>
#> Estimate: 1000482.40
#> Standard Error: 153600.47
#> t = 4.89, df = 744.0, p = 0.0000 ***
#> CI (95%): [698940.49, 1302024.31]
scf_ttest(scf2022, ~networth, group = ~senior)
#> SCF Two-sample t-test
#> Alternative hypothesis: mean is not equal to 0
#>
#> Group means:
#> group mean
#> FALSE 809777.2
#> TRUE 1595858.8
#>
#> Estimate: -786081.60
#> Standard Error: 507708.12
#> t = -1.55, df = 114.2, p = 0.1243
#> CI (95%): [-1791827.18, 219663.98]
scf_prop_test(scf2022, ~senior, p = 0.25)
#>
#> One-sample proportion test
#> Null hypothesis: proportion = 0.25
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#>
#> estimate std.error z.value p.value conf.low conf.high stars
#> 0.2423 0.0022 -3.5822 3e-04 0.2381 0.2465 ***
scf_prop_test(scf2022, ~rich, ~female)
#>
#> Two-sample proportion test
#> Null hypothesis: proportion difference = 0.5
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#>
#> estimate std.error z.value p.value conf.low conf.high stars
#> 0.1626 0.0253 -13.3593 0 0.1131 0.2121 ***
#>
#> Estimated group proportions:
#> group proportion
#> Male 0.1842
#> Female 0.0216scf_ols(), scf_glm(), and
scf_logit() fit replicate-weighted models to each implicate
and pool coefficients via Rubin’s Rules.
scf_ols(scf2022, log_networth ~ age + log_income)
#> OLS Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error t.value p.value stars
#> (Intercept) -16.34472 2.71954 -6.010 3.185e-09 ***
#> age 0.08801 0.01307 6.732 2.218e-11 ***
#> log_income 2.03060 0.22425 9.055 6.056e-18 ***
#>
#> Model Fit Statistics:
#> Mean R-squared: 0.4035 (SD: 0.011)
#> Mean AIC: 342.0288 (SD: 498.395)
#>
#> Note: Implicate-level model objects are stored in `object$imps`
#> Use `summary(object$imps[[1]])` to inspect them.
scf_logit(scf2022, rich ~ age + log_income, odds = TRUE)
#> Logistic Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error t.value p.value stars
#> (Intercept) 0.0000 0.0000 -4.0575 <2e-16 ***
#> age 1.0964 0.0409 2.4634 0.0138 *
#> log_income 14.1789 9.3306 4.0297 0.0001 ***
#>
#> Model Fit Diagnostics:
#> Pseudo R-squared: 0.5069
#> Mean AIC: NA
#>
#> Notes:
#> - Estimates are reported on the Odds Ratio scale.
#> - Implicate-level models are stored in `object$imps`
scf_glm(scf2022, own ~ age, family = binomial())
#> Generalized Linear Model (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error z.value p.value stars
#> (Intercept) 1.4508 0.6710 2.1621 0.03061 *
#> age 0.0070 0.0127 0.5494 0.58270
#>
#> Model Fit Diagnostics:
#> Pseudo R-squared: 0.002 (SD: 0.000)
#> Mean AIC: NA (SD: NA)
#>
#> Note: Model fit pooled across implicates via Rubin's Rules.
#> Inspect individual models via `object$models[[i]]`.scf_regtable() formats results from one or more models
for publication:
m1 <- scf_ols(scf2022, log_networth ~ age)
m2 <- scf_ols(scf2022, log_networth ~ age + log_income)
scf_regtable(m1, m2, model.names = c("Model 1", "Model 2"), digits = 3)
#> (Intercept) 6.551*** (0.918) -16.345*** (2.720)
#> age 0.082*** (0.015) 0.088*** (0.013)
#> log_income -- 2.031*** (0.224)
#> N 200 200
#> R2 0.133 0.404
#> AIC 366 342OLS models the conditional mean, which is sensitive to outliers and
skew. Wealth and income distributions in the SCF are highly
right-skewed: a small number of high-wealth households can dominate mean
estimates. Quantile regression estimates the conditional quantile of the
outcome at a user-specified probability tau, giving a more
complete picture of distributional associations.
scf_quantreg() fits quantreg::rq() with SCF
final sampling weights to each implicate, then pools coefficients and
variance-covariance matrices across implicates via
scf_MIcombine().
# Median regression
m_med <- scf_quantreg(scf2022, log_networth ~ age + senior, tau = 0.50)
print(m_med)
#> Quantile Regression Results (tau = 0.50, Multiply-Imputed SCF)
#> ------------------------------------------------------------------
#> term estimate std.error t.value p.value stars
#> (Intercept) 8.2337 0.9775 8.4232 < 2.2e-16 ***
#> age 0.0772 0.0213 3.6295 0.0002844 ***
#> seniorTRUE -1.6873 0.8235 -2.0490 0.0404912 *
#>
#> SE method: nid | Implicates pooled: 5
#> R1(tau): 0.0661 R1 adj: 0.0518
#> Implicate-level rq objects stored in `object$models`.# 75th percentile
m_75 <- scf_quantreg(scf2022, log_networth ~ age + senior, tau = 0.75)
summary(m_75)
#> SCF Quantile Regression Summary (tau = 0.75)
#> ------------------------------------------------------------------
#> Pooled Coefficient Estimates:
#> term estimate std.error t.value p.value stars
#> (Intercept) 9.4118 0.9856 9.5494 < 2.2e-16 ***
#> age 0.0761 0.0185 4.1243 3.856e-05 ***
#> seniorTRUE -1.7278 0.6220 -2.7776 0.005484 **
#>
#> Quantile: 0.75
#> SE method: nid
#> Implicates used: 5
#>
#> Goodness of Fit (Koenker-Machado, 1999):
#> Rho (full model): 252.7810
#> Rho (null model): 276.8924
#> R1(tau): 0.0871
#> R1(tau) adjusted: 0.0732
#> Mean N (implicates): 200
#> Note: R1 is a local fit measure at tau; it is not a global
#> summary of fit across the conditional distribution.
#> R1 adjusted uses a df-penalty not derived from asymptotic
#> theory; interpret it descriptively.
#>
#> Call:
#> scf_quantreg(object = scf2022, formula = log_networth ~ age +
#> senior, tau = 0.75)The se argument controls within-implicate variance
estimation. All methods feed a covariance matrix into
scf_MIcombine(), so between-implicate (imputation) variance
is always incorporated.
"nid" (default): non-iid sandwich estimator. Allows the
conditional sparsity — the density of the error distribution at the
quantile — to vary across observations. Appropriate when the error
distribution is not constant across the covariate space, which is
typical for wealth and income outcomes."iid": implements the Koenker-Bassett (1978) covariance
formula [θ(1-θ)/f(ξ(θ))²] Q⁻¹, where f(ξ(θ))
is the density of the error at the θ-quantile (the sparsity) and
Q = lim T⁻¹X'X. This assumes the error distribution is
identical across all observations — constant sparsity, not merely
constant variance. Fastest option."ker": kernel density estimate of the conditional
sparsity. More data-adaptive than "nid"."boot": pairs bootstrap. Distribution-free but slowest
of the analytical options."replicate": re-fits the model using each of the SCF’s
999 replicate weight vectors per implicate via
survey::withReplicates(), accumulating variance as the
weighted sum of squared deviations from the full-weight estimate. This
matches the SCF’s own published variance methodology and is recommended
for final publication-quality estimates. It requires approximately 5,000
model fits across five implicates and is computationally intensive.# Replication-based variance (recommended for publication; slow)
m_rep <- scf_quantreg(scf2022, log_networth ~ age + senior,
tau = 0.50, se = "replicate")
summary(m_rep)A note on design: the survey package provides
svyquantile() for estimating marginal quantiles of a single
variable, but it has no function for regression quantiles (conditional
quantiles given covariates). scf_quantreg() therefore uses
quantreg::rq() for point estimation, with
survey::withReplicates() providing the design-based
variance wrapper for the "replicate" path. This keeps the
SCF’s replication scheme encapsulated in the survey design object,
rather than reimplementing it manually.
Koenker and Bassett (1978) established the asymptotic theory for the
unweighted, i.i.d. case. Applying rq() with survey sampling
weights extends the estimator to probability-weighted samples, which is
standard empirical practice; the "iid" and
"nid" standard error formulas extend correspondingly to the
weighted design matrix X'WX. The "replicate"
option sidesteps this by using the SCF’s own replication scheme to
estimate variance directly, without relying on distributional
assumptions about the errors.
Results are compatible with scf_regtable():
scf_regtable(m_med, m_75,
model.names = c("Median", "75th Pct"),
digits = 3)
#> (Intercept) 8.234*** (0.978) 9.412*** (0.986)
#> age 0.077*** (0.021) 0.076*** (0.018)
#> seniorTRUE -1.687* (0.823) -1.728** (0.622)
#> N 200 200
#> Tau 0.50 0.75
#> R1 0.066 0.087
#> R1(adj) 0.052 0.073To study how associations vary across the outcome distribution, estimate the same model at multiple quantiles and compare:
scf plotting functions account for survey weights and
multiply-imputed data.
Use scf_implicates() to retrieve implicate-level
estimates from any scf_* result for sensitivity analysis or
custom pooling.
freq_table <- scf_freq(scf2022, ~rich)
scf_implicates(freq_table, long = TRUE)
#> implicate group category est var estimate se
#> richFALSE 1 NA FALSE 0.8730810 0.0004432830 0.8730810 0.02105429
#> richTRUE 1 NA TRUE 0.1269190 0.0004432830 0.1269190 0.02105429
#> richFALSE1 2 NA FALSE 0.8531922 0.0005488627 0.8531922 0.02342782
#> richTRUE1 2 NA TRUE 0.1468078 0.0005488627 0.1468078 0.02342782
#> richFALSE2 3 NA FALSE 0.8725839 0.0004395554 0.8725839 0.02096558
#> richTRUE2 3 NA TRUE 0.1274161 0.0004395554 0.1274161 0.02096558
#> richFALSE3 4 NA FALSE 0.8794327 0.0003846508 0.8794327 0.01961252
#> richTRUE3 4 NA TRUE 0.1205673 0.0003846508 0.1205673 0.01961252
#> richFALSE4 5 NA FALSE 0.8827906 0.0004057971 0.8827906 0.02014441
#> richTRUE4 5 NA TRUE 0.1172094 0.0004057971 0.1172094 0.02014441
#> lower upper cv
#> richFALSE 0.83181461 0.9143474 0.02411493
#> richTRUE 0.08565258 0.1681854 0.16588761
#> richFALSE1 0.80727369 0.8991107 0.02745902
#> richTRUE1 0.10088926 0.1927263 0.15958158
#> richFALSE2 0.83149137 0.9136764 0.02402700
#> richTRUE2 0.08632357 0.1685086 0.16454418
#> richFALSE3 0.84099216 0.9178732 0.02230133
#> richTRUE3 0.08212678 0.1590078 0.16266860
#> richFALSE4 0.84330761 0.9222737 0.02281901
#> richTRUE4 0.07772632 0.1566924 0.17186689Implicate-level regression model objects are stored directly in the result: