Analyzing the Survey of Consumer Finances

Introduction

The Survey of Consumer Finances (SCF) is a triennial survey of U.S. household finances conducted by the Federal Reserve Board. It is among the most detailed data sources on U.S. household wealth, income, and financial behavior.

Valid estimation from the SCF requires handling two design features:

  1. Complex survey design: The SCF uses a dual-frame design with 999 replicate weights per implicate, constructed via balanced repeated replication, which enable design-consistent variance estimation.
  2. Multiple imputation: The SCF addresses item nonresponse through multiple imputation. Each release includes five implicates — complete, plausible versions of the dataset with different imputed values for missing items.

The scf package handles both. It wraps the five implicates in replicate-weighted survey::svrepdesign() objects, applies estimation routines across implicates, and pools results using Rubin’s Rules or equivalent procedures.

This vignette demonstrates the core workflow. For methodological background, see the package paper.

Workflow

1. Downloading and Loading the Data

Download SCF data and load it into a multiply-imputed survey object using scf_download() and scf_load().

scf_download(2022)
scf2022 <- scf_load(2022)

The result is a scf_mi_survey object containing five svyrep.design objects, one per implicate, and survey-year metadata.

2. Creating and Transforming Variables

scf_update() adds or modifies variables uniformly across all five implicates. Bottom-code skewed variables before logging to avoid log(0).

scf2022 <- scf_update(scf2022,
  senior       = age >= 65,
  female       = factor(hhsex, levels = 1:2, labels = c("Male", "Female")),
  rich         = networth > 1e6,
  networth     = ifelse(networth > 1, networth, 1),
  log_networth = log(networth),
  income       = ifelse(income > 1, income, 1),
  log_income   = log(income),
  npeople      = x101
)

Use names(scf2022$mi_design[[1]]$variables) to inspect available variables.

When a transformation depends on the distribution within each implicate — ranks, percentile flags, groupwise z-scores — use scf_update_by_implicate() instead:

scf2022 <- scf_update_by_implicate(scf2022, function(df) {
  df$wealth_rank <- rank(df$networth) / nrow(df)
  df
})

3. Univariate and Bivariate Distributions

scf_mean(), scf_median(), and scf_percentile() return pooled population estimates with standard errors. The by argument produces group-level estimates.

scf_mean(scf2022, ~networth, by = ~senior)
#> Multiply-Imputed, Replicate-Weighted Mean Estimate
#> 
#>  group variable  estimate       se     min       max
#>  FALSE networth  809777.2 135477.9  771184  821626.9
#>   TRUE networth 1595858.8 483053.9 1355046 1812726.0
scf_median(scf2022, ~income, by = ~female)
#> Multiply-Imputed Median Estimate
#> 
#>   group variable quantile estimate       se      min      max
#>    Male   income      0.5 85824.40 6488.425 85392.03 86472.95
#>  Female   income      0.5 49721.94 5304.311 49721.94 49721.94
scf_percentile(scf2022, ~networth, q = 0.9)
#> SCF Percentile Estimate
#> 
#> SCF Percentile Estimate (SCF Bulletin convention)
#> 
#>  variable quantile estimate       se     min     max
#>  networth      0.9  1197722 393453.4 1039600 1360000
scf_percentile(scf2022, ~networth, q = 0.75, by = ~female)
#> SCF Percentile Estimate
#> 
#> SCF Percentile Estimate (SCF Bulletin convention)
#> 
#>   group variable quantile estimate        se    min    max
#>    Male networth     0.75   642660 192745.73 623800 698700
#>  Female networth     0.75   238504  56917.53 237680 241800

Discrete distributions are summarized with scf_freq() and cross-tabulated with scf_xtab().

scf_freq(scf2022, ~senior)
#> SCF Frequency Table (Pooled Results)
#> 
#>  group category proportion se_proportion
#>     NA    FALSE   75.77111      3.135819
#>     NA     TRUE   24.22889      3.135819
scf_xtab(scf2022, ~senior, ~female, scale = "col")
#> SCF Cross-Tabulation
#> Row Variable: senior | Column Variable: female 
#> Displayed as: col proportions (percent)
#> 
#>               female: Female female: Male
#> senior: FALSE          78.51        70.60
#> senior: TRUE           21.49        29.40

4. Hypothesis Tests

scf_ttest() and scf_prop_test() produce pooled inferential tests with correct degrees of freedom.

scf_ttest(scf2022, ~networth, mu = 250000)
#> SCF One-sample t-test
#> Alternative hypothesis: mean is not equal to 250000 
#> 
#> Estimate: 1000482.40 
#> Standard Error: 153600.47 
#> t = 4.89, df = 744.0, p = 0.0000 *** 
#> CI (95%): [698940.49, 1302024.31]
scf_ttest(scf2022, ~networth, group = ~senior)
#> SCF Two-sample t-test
#> Alternative hypothesis: mean is not equal to 0 
#> 
#> Group means:
#>  group      mean
#>  FALSE  809777.2
#>   TRUE 1595858.8
#> 
#> Estimate: -786081.60 
#> Standard Error: 507708.12 
#> t = -1.55, df = 114.2, p = 0.1243  
#> CI (95%): [-1791827.18, 219663.98]
scf_prop_test(scf2022, ~senior, p = 0.25)
#> 
#> One-sample proportion test
#> Null hypothesis: proportion = 0.25
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#> 
#>  estimate std.error z.value p.value conf.low conf.high stars
#>    0.2423    0.0022 -3.5822   3e-04   0.2381    0.2465   ***
scf_prop_test(scf2022, ~rich, ~female)
#> 
#> Two-sample proportion test
#> Null hypothesis: proportion difference = 0.5
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#> 
#>  estimate std.error  z.value p.value conf.low conf.high stars
#>    0.1626    0.0253 -13.3593       0   0.1131    0.2121   ***
#> 
#> Estimated group proportions:
#>   group proportion
#>    Male     0.1842
#>  Female     0.0216

5. Regression Modeling

scf_ols(), scf_glm(), and scf_logit() fit replicate-weighted models to each implicate and pool coefficients via Rubin’s Rules.

scf_ols(scf2022, log_networth ~ age + log_income)
#> OLS Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term  estimate std.error t.value   p.value stars
#>  (Intercept) -16.34472   2.71954  -6.010 3.185e-09   ***
#>          age   0.08801   0.01307   6.732 2.218e-11   ***
#>   log_income   2.03060   0.22425   9.055 6.056e-18   ***
#> 
#> Model Fit Statistics:
#>   Mean R-squared: 0.4035 (SD: 0.011)
#>   Mean AIC:       342.0288 (SD: 498.395)
#> 
#> Note: Implicate-level model objects are stored in `object$imps`
#>       Use `summary(object$imps[[1]])` to inspect them.
scf_logit(scf2022, rich ~ age + log_income, odds = TRUE)
#> Logistic Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term estimate std.error t.value p.value stars
#>  (Intercept)   0.0000    0.0000 -4.0575  <2e-16   ***
#>          age   1.0964    0.0409  2.4634  0.0138     *
#>   log_income  14.1789    9.3306  4.0297  0.0001   ***
#> 
#> Model Fit Diagnostics:
#>   Pseudo R-squared:  0.5069 
#>   Mean AIC:          NA 
#> 
#> Notes:
#>  - Estimates are reported on the Odds Ratio scale.
#>  - Implicate-level models are stored in `object$imps`
scf_glm(scf2022, own ~ age, family = binomial())
#> Generalized Linear Model (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term estimate std.error z.value p.value stars
#>  (Intercept)   1.4508    0.6710  2.1621 0.03061     *
#>          age   0.0070    0.0127  0.5494 0.58270      
#> 
#> Model Fit Diagnostics:
#>   Pseudo R-squared: 0.002 (SD: 0.000)
#>   Mean AIC:         NA (SD: NA)
#> 
#> Note: Model fit pooled across implicates via Rubin's Rules.
#>       Inspect individual models via `object$models[[i]]`.

scf_regtable() formats results from one or more models for publication:

m1 <- scf_ols(scf2022, log_networth ~ age)
m2 <- scf_ols(scf2022, log_networth ~ age + log_income)
scf_regtable(m1, m2, model.names = c("Model 1", "Model 2"), digits = 3)
#> (Intercept)   6.551*** (0.918)   -16.345*** (2.720)
#> age           0.082*** (0.015)   0.088*** (0.013)
#> log_income    --   2.031*** (0.224)
#> N             200   200
#> R2            0.133   0.404
#> AIC           366   342

6. Quantile Regression

OLS models the conditional mean, which is sensitive to outliers and skew. Wealth and income distributions in the SCF are highly right-skewed: a small number of high-wealth households can dominate mean estimates. Quantile regression estimates the conditional quantile of the outcome at a user-specified probability tau, giving a more complete picture of distributional associations.

scf_quantreg() fits quantreg::rq() with SCF final sampling weights to each implicate, then pools coefficients and variance-covariance matrices across implicates via scf_MIcombine().

# Median regression
m_med <- scf_quantreg(scf2022, log_networth ~ age + senior, tau = 0.50)
print(m_med)
#> Quantile Regression Results (tau = 0.50, Multiply-Imputed SCF)
#> ------------------------------------------------------------------
#>         term estimate std.error t.value   p.value stars
#>  (Intercept)   8.2337    0.9775  8.4232 < 2.2e-16   ***
#>          age   0.0772    0.0213  3.6295 0.0002844   ***
#>   seniorTRUE  -1.6873    0.8235 -2.0490 0.0404912     *
#> 
#> SE method: nid | Implicates pooled: 5
#> R1(tau):   0.0661    R1 adj:  0.0518
#> Implicate-level rq objects stored in `object$models`.
# 75th percentile
m_75 <- scf_quantreg(scf2022, log_networth ~ age + senior, tau = 0.75)
summary(m_75)
#> SCF Quantile Regression Summary (tau = 0.75)
#> ------------------------------------------------------------------
#> Pooled Coefficient Estimates:
#>         term estimate std.error t.value   p.value stars
#>  (Intercept)   9.4118    0.9856  9.5494 < 2.2e-16   ***
#>          age   0.0761    0.0185  4.1243 3.856e-05   ***
#>   seniorTRUE  -1.7278    0.6220 -2.7776  0.005484    **
#> 
#> Quantile:        0.75
#> SE method:       nid
#> Implicates used: 5
#> 
#> Goodness of Fit (Koenker-Machado, 1999):
#>   Rho (full model):  252.7810
#>   Rho (null model):  276.8924
#>   R1(tau):           0.0871
#>   R1(tau) adjusted:  0.0732
#>   Mean N (implicates): 200
#>   Note: R1 is a local fit measure at tau; it is not a global
#>   summary of fit across the conditional distribution.
#>   R1 adjusted uses a df-penalty not derived from asymptotic
#>   theory; interpret it descriptively.
#> 
#> Call:
#> scf_quantreg(object = scf2022, formula = log_networth ~ age + 
#>     senior, tau = 0.75)

The se argument controls within-implicate variance estimation. All methods feed a covariance matrix into scf_MIcombine(), so between-implicate (imputation) variance is always incorporated.

# Replication-based variance (recommended for publication; slow)
m_rep <- scf_quantreg(scf2022, log_networth ~ age + senior,
                       tau = 0.50, se = "replicate")
summary(m_rep)

A note on design: the survey package provides svyquantile() for estimating marginal quantiles of a single variable, but it has no function for regression quantiles (conditional quantiles given covariates). scf_quantreg() therefore uses quantreg::rq() for point estimation, with survey::withReplicates() providing the design-based variance wrapper for the "replicate" path. This keeps the SCF’s replication scheme encapsulated in the survey design object, rather than reimplementing it manually.

Koenker and Bassett (1978) established the asymptotic theory for the unweighted, i.i.d. case. Applying rq() with survey sampling weights extends the estimator to probability-weighted samples, which is standard empirical practice; the "iid" and "nid" standard error formulas extend correspondingly to the weighted design matrix X'WX. The "replicate" option sidesteps this by using the SCF’s own replication scheme to estimate variance directly, without relying on distributional assumptions about the errors.

Results are compatible with scf_regtable():

scf_regtable(m_med, m_75,
             model.names = c("Median", "75th Pct"),
             digits = 3)
#> (Intercept)   8.234*** (0.978)   9.412*** (0.986)
#> age           0.077*** (0.021)   0.076*** (0.018)
#> seniorTRUE    -1.687* (0.823)   -1.728** (0.622)
#> N             200   200
#> Tau           0.50   0.75
#> R1            0.066   0.087
#> R1(adj)       0.052   0.073

To study how associations vary across the outcome distribution, estimate the same model at multiple quantiles and compare:

taus <- c(0.25, 0.50, 0.75, 0.90)
models <- lapply(taus, function(t) {
  scf_quantreg(scf2022, log_networth ~ age + senior, tau = t)
})
names(models) <- paste0("tau=", taus)
do.call(scf_regtable, c(models, list(digits = 3)))

7. Visualization

scf plotting functions account for survey weights and multiply-imputed data.

scf_plot_dbar(scf2022, ~senior)

scf_plot_bbar(scf2022, ~female, ~rich, scale = "percent")

scf_plot_cbar(scf2022, ~networth, ~edcl, stat = "median")

scf_plot_dist(scf2022, ~age, bins = 10)

scf_plot_smooth(scf2022, ~age)

scf_plot_hex(scf2022, ~income, ~networth)

8. Inspecting Implicates

Use scf_implicates() to retrieve implicate-level estimates from any scf_* result for sensitivity analysis or custom pooling.

freq_table <- scf_freq(scf2022, ~rich)
scf_implicates(freq_table, long = TRUE)
#>            implicate group category       est          var  estimate         se
#> richFALSE          1    NA    FALSE 0.8730810 0.0004432830 0.8730810 0.02105429
#> richTRUE           1    NA     TRUE 0.1269190 0.0004432830 0.1269190 0.02105429
#> richFALSE1         2    NA    FALSE 0.8531922 0.0005488627 0.8531922 0.02342782
#> richTRUE1          2    NA     TRUE 0.1468078 0.0005488627 0.1468078 0.02342782
#> richFALSE2         3    NA    FALSE 0.8725839 0.0004395554 0.8725839 0.02096558
#> richTRUE2          3    NA     TRUE 0.1274161 0.0004395554 0.1274161 0.02096558
#> richFALSE3         4    NA    FALSE 0.8794327 0.0003846508 0.8794327 0.01961252
#> richTRUE3          4    NA     TRUE 0.1205673 0.0003846508 0.1205673 0.01961252
#> richFALSE4         5    NA    FALSE 0.8827906 0.0004057971 0.8827906 0.02014441
#> richTRUE4          5    NA     TRUE 0.1172094 0.0004057971 0.1172094 0.02014441
#>                 lower     upper         cv
#> richFALSE  0.83181461 0.9143474 0.02411493
#> richTRUE   0.08565258 0.1681854 0.16588761
#> richFALSE1 0.80727369 0.8991107 0.02745902
#> richTRUE1  0.10088926 0.1927263 0.15958158
#> richFALSE2 0.83149137 0.9136764 0.02402700
#> richTRUE2  0.08632357 0.1685086 0.16454418
#> richFALSE3 0.84099216 0.9178732 0.02230133
#> richTRUE3  0.08212678 0.1590078 0.16266860
#> richFALSE4 0.84330761 0.9222737 0.02281901
#> richTRUE4  0.07772632 0.1566924 0.17186689

Implicate-level regression model objects are stored directly in the result:

m <- scf_ols(scf2022, log_networth ~ age)
summary(m$imps[[1]])   # first implicate svyglm object

Learn More