Type: | Package |
Title: | Artless Automatic Multivariate Matching for Observational Studies |
Version: | 0.3.7 |
Maintainer: | Paul Rosenbaum <rosenbaum@wharton.upenn.edu> |
Description: | Implements a simple version of multivariate matching using a propensity score, near-exact matching, near-fine balance, and robust Mahalanobis distance matching (Rosenbaum 2020 <doi:10.1146/annurev-statistics-031219-041058>). You specify the variables, and the program does everything else. |
License: | GPL-2 |
Encoding: | UTF-8 |
Imports: | iTOS, stats |
Suggests: | DOS2, sensitivity2x2xk, sensitivitymv, weightedRank, xtable |
Depends: | R (≥ 3.5.0) |
NeedsCompilation: | no |
Packaged: | 2025-06-21 15:13:31 UTC; rosenbap |
Author: | Paul Rosenbaum [aut, cre] |
Repository: | CRAN |
Date/Publication: | 2025-06-24 09:40:02 UTC |
Artless Automatic Multivariate Matching for Observational Studies
Description
Implements a simple version of multivariate matching using a propensity score, near-exact matching, near-fine balance, and robust Mahalanobis distance matching (Rosenbaum 2020 <doi:10.1146/annurev-statistics-031219-041058>). You specify the variables, and the program does everything else.
Details
Package aamatch implements a simple version of multivariate matching in observational studies, using propensity scores, minimum distance matching, near-exact matching and fine balance. The only function in the package is artless().
Author(s)
Paul Rosenbaum [aut, cre]
Maintainer: Paul Rosenbaum <rosenbaum@wharton.upenn.edu>
References
Rosenbaum, P. R. (2020a) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd Edition). New York: Springer.
Rosenbaum, P. R. (2020b). <doi:10.1146/annurev-statistics-031219-041058> Modern algorithms for matching in observational studies. Annual Review of Statistics and Its Application, 7(1), 143-176.
Rosenbaum, P. R. (2025) Introduction to the Theory of Observational Studies. New York: Springer.
Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151.
Matched Periodontal Disease Data
Description
Matched data from NHANES 2009-2010, 2011-2012, 2013-2014 concerning smoking and periodontal disease. The matched data were built from the unmatched data in PeriUnmatched in this package.
Usage
data("PeriMatched")
Format
A data frame with 3489 observations on the following 18 variables.
SEQN
NHANES ID number
female
1=female, 0=male
age
Age in years, capped at 80 for confidentiality
ageFloor
Age decade = floor(age/10)
educ
Education as 1 to 5. 1 is less than 9th grade, 2 at least 9th grade with no high school degree, 3 is a high school degree, 4 is some college, such as a 2-year associates degree, 5 is at least a 4-year college degree.
noHS
No high school degree. 1 if educ is 1 or 2, 0 if educ is 3 or more
income
Ratio of family income to the poverty level, capped at 5 for confidenditality
nh
The specific NHANES survey. A factor
nh0910
<nh1112
<nh1314
cigsperday
Number of cigarettes smoked per day. 0 for nonsmokers.
z
Daily smoker. 1 indicates someone who smokes everyday. 0 indicates a never-smoker who smoked fewer than 100 cigarettes in their life.
pd
A percent indicating periodontal disease. See details.
prop
A propensity score created in the example for PeriUnmatched. This propensity score decided which smokers would have 1 control and which would have 5 controls.
pr
A second propensity score used to create matched pairs or matched 1-to-4 sets, after the split based on prop
mset
Indicator of the matched set, 1, 2, ..., 1425
treated
The SEQN for the smoker in this matched set. Contains the same information as mset, but in a different form.
pair
1 for a matched pair, 0 for a 1-to-4 matched set
grp2
An ordered factor with the same information as z: S=daily smoker, N=never smoker.
S
<N
grp3
A factor with the joint information in pair and grp2.
1-1:S
1-1:N
1-4:S
1-4:N
Details
Measurements were made for up to 28 teeth, 14 upper, 14 lower, excluding 4 wisdom teeth. Pocket depth and loss of attachment are two complementary measures of the degree to which the gums have separated from the teeth; see Wei, Barker and Eke (2013). Pocket depth and loss of attachment are measured at six locations on each tooth, providing the tooth is present. A measurement at a location was taken to exhibit disease if it had either a loss of attachement >=4mm or a pocked depth >=4mm, so each tooth contributes six binary scores, up to 6x28=168 binary scores. The variable pd is the percent of these binary scores indicating periodontal disease, 0 to 100 percent.
The data from three NHANES surveys (specifically 2009-2010, 2011-2012, and 2013-2014) contain periodontal data and are used as an example in Rosenbaum (2025). The data from one survey, 2011-2012, were used in Rosenbaum (2016). The example replicates analyses from Rosenbaum (2025).
Note
All analyses below distinguish the 1-to-1 pairs and the 1-to-4 sets, even though the information they provide is often combined. Alternatively, one can combine analyses of pairs and 1-to-4 sets using methods that take account of the matched blocks of variable sizes. For instance, for continuous responses, one can use the methods in Rosenbaum (2007) as implemented in the R package sensitivitymv; see also Rosenbaum (2015). For binary responses, one can use the methods in Rosenbaum and Small (2017) as implemented in the R package sensitivity2x2xk.
In contrast, some care is required in plots and descriptive statistics. One can straightforwardly plot the pairs, then separately plot the 1-to-4 sets, and one can do the same with descriptive statistics. Suppose, however, that one merges the two treated groups from pairs and 1-to-4 sets, and merges the two control groups from pairs and 1-to-4 sets; then marginal distributions of outcomes from the pooled treated and control groups are no longer comparable. See Pimentel, Yoon and Keele (2015). For instance, in the example, there is exact matching for sex; however, most pairs are men and most 1-to-4 sets are women. Pool the pairs and the 1-to-4 sets and the pooled control group has proportionately more women than the pooled treated group. To see this, type:
data("PeriMatched")
tapply(PeriMatched$female,PeriMatched$grp3,mean)
tapply(PeriMatched$female,PeriMatched$grp2,mean)
The simple, often enlightening, solution is to plot pairs and 1-to-4 sets in parallel but separately, and to do the same with descriptive statistics.
Source
US National Health and Nutrition Examination Survey (NHANES). https://www.cdc.gov/nchs/nhanes/
References
Pimentel, S. D., Yoon, F., & Keele, L. (2015) <doi:10.1002/sim.6593> Variable‐ratio matching with fine balance in a study of the Peer Health Exchange. Statistics in Medicine, 34(30), 4070-4082.
Rosenbaum, P. R. (2007) <doi:10.1111/j.1541-0420.2006.00717.x> Sensitivity analysis for m-estimates, tests, and confidence intervals in matched observational studies. Biometrics, 63(2), 456-464.
Rosenbaum, P. R. (2015) <doi:10.1353/obs.2015.0000> Two R packages for sensitivity analysis in observational studies. Observational Studies, 1(2), 1-17. Available on-line at: muse.jhu.edu/article/793399/summary
Rosenbaum, P. R. (2016) <doi:10.1214/16-AOAS942> Using Scheffe projections for multiple outcomes in an observational study of smoking and periondontal disease. Annals of Applied Statistics, 10, 1447-1471.
Rosenbaum, P. R., & Small, D. S. (2017) <doi:10.1111/biom.12591> An adaptive Mantel–Haenszel test for sensitivity analysis in observational studies. Biometrics, 73(2), 422-430.
Rosenbaum, Paul R. (2025) A Design for Observational Studies in Which Some People Avoid Treatment. Manuscript.
Tomar, S. L. and Asma, S. (2000). Smoking attributable periodontitis in the United States: Findings from NHANES III. J. Periodont. 71, 743-751.
Wei, L., Barker, L. and Eke, P. (2013). Array applications in determining periodontal disease measurement. SouthEast SAS User's Group. (SESUG2013) Paper CC-15, analytics.ncsu.edu/ sesug/2013/CC-15.pdf.
Examples
data(PeriMatched)
# The analysis in Rosenbaum (2025) is replicated below
#
dm2<-PeriMatched
dm<-PeriMatched[PeriMatched$pair==1,]
dm1<-PeriMatched[PeriMatched$pair==0,]
pd1<-t(matrix(dm$pd,2,dim(dm)[1]/2))
pd4<-t(matrix(dm1$pd,5,dim(dm1)[1]/5))
dm2$mset<-as.integer(dm2$mset)
#
# Make Figure 1
#
old.par <- par(no.readonly = TRUE)
par(mfrow=c(1,3))
boxplot(dm2$prop~dm2$grp3,names=c(expression(S[1]),expression(N[1]),
expression(S[4]),expression(N[4])),
las=1,sub="Left is 1-1, Right is 1-4",cex.sub=.9,cex.axis=1,
ylab="Propensity Score",xlab="(i) Propensity Score")
#axis(3,at=1:4,lab=round(tapply(dm2$prop,dm2$grp3,mean),2),cex.axis=1)
axis(3,at=1:4,lab=c("0.36","0.34","0.10","0.10"),cex.axis=1) # don't round 0.1
boxplot(dm2$educ~dm2$grp3,names=c(expression(S[1]),expression(N[1]),
expression(S[4]),expression(N[4])),
las=1,sub="Left is 1-1, Right is 1-4",cex.sub=.9,cex.axis=1,
ylab="Education: 1 is <9th, 3 is HS, 5 is BA",xlab="(ii) Education")
#axis(3,at=1:4,lab=round(tapply(dm2$educ,dm2$grp3,mean),1),cex.axis=1)
axis(3,at=1:4,lab=c("3.0","3.1","4.0","4.0"),cex.axis=1)
boxplot(dm2$income~dm2$grp3,names=c(expression(S[1]),expression(N[1]),
expression(S[4]),expression(N[4])),
las=1,sub="Left is 1-1, Right is 1-4",cex.sub=.9,cex.axis=1,
ylab="Income / (Poverty Level)",xlab="(iii) Income")
axis(3,at=1:4,lab=round(tapply(dm2$income,dm2$grp3,mean),1),cex.axis=1)
#
# Make Figure 2
#
par(mfrow=c(1,2))
boxplot(dm2$cigsperday~dm2$grp3,names=c(expression(S[1]),expression(N[1]),
expression(S[4]),expression(N[4])),
las=1,sub="Left is 1-1, Right is 1-4",cex.sub=.9,cex.axis=1,
ylab="Cigarettes Per Day",xlab="(i) Cigarettes Per Day")
axis(3,at=1:4,lab=round(tapply(dm2$cigsperday,dm2$grp3,mean),0),cex.axis=1)
boxplot(dm2$pd~dm2$grp3,names=c(expression(S[1]),expression(N[1]),
expression(S[4]),expression(N[4])),
las=1,sub="Left is 1-1, Right is 1-4",cex.sub=.9,cex.axis=1,
ylab="Periodonal Disease",xlab="(ii) Periodontal Disease")
axis(3,at=1:4,lab=round(tapply(dm2$pd,dm2$grp3,mean),0),cex.axis=1)
#
# Make Table 1
#
tb<-NULL
N<-tapply(dm2$female,dm2$grp3,length)
tb<-cbind(tb,N)
rm(N)
Female<-tapply(dm2$female,dm2$grp3,mean)*100
tb<-cbind(tb,Female)
rm(Female)
Age<-tapply(dm2$age,dm2$grp3,mean)
tb<-cbind(tb,Age)
rm(Age)
Income<-tapply(dm2$income,dm2$grp3,mean)
tb<-cbind(tb,Income)
rm(Income)
Income10<-tapply(dm2$income,dm2$grp3,quantile,c(.1))
tb<-cbind(tb,Income10)
rm(Income10)
Income90<-tapply(dm2$income,dm2$grp3,quantile,c(.9))
tb<-cbind(tb,Income90)
rm(Income90)
Education25<-tapply(dm2$educ,dm2$grp3,quantile,c(.25))
tb<-cbind(tb,Education25)
rm(Education25)
Education50<-tapply(dm2$educ,dm2$grp3,quantile,c(.5))
tb<-cbind(tb,Education50)
rm(Education50)
Education75<-tapply(dm2$educ,dm2$grp3,quantile,c(.75))
tb<-cbind(tb,Education75)
rm(Education75)
PropensityMin<-tapply(dm2$prop,dm2$grp3,min)
tb<-cbind(tb,PropensityMin)
rm(PropensityMin)
Propensity<-tapply(dm2$prop,dm2$grp3,median)
tb<-cbind(tb,Propensity)
rm(Propensity)
PropensityMax<-tapply(dm2$prop,dm2$grp3,max)
tb<-cbind(tb,PropensityMax)
rm(PropensityMax)
xtable::xtable(tb,digits=c(NA,0,1,1,1,1,1,0,0,0,2,2,2))
addmargins(table(dm2$z,dm2$prop>.15))
#
# Make Table 2 regarding sensitivity analysis
#
gammas<-c(1:5,5.5,6)
ngamma<-length(gammas)
tabSen<-matrix(NA,ngamma,4)
colnames(tabSen)<-c("Pairs 1-1","Sets 1-4","Fisher","Truncated")
rownames(tabSen)<-gammas
for (i in 1:ngamma) tabSen[i,1]<-weightedRank::wgtRank(pd1,phi="u878",gamma=gammas[i])$pval
for (i in 1:ngamma) tabSen[i,2]<-weightedRank::wgtRank(pd4,phi="u878",gamma=gammas[i])$pval
for (i in 1:ngamma) {
if (min(tabSen[i,1:2]==0)) tabSen[i,3:4]<-0
else{
tabSen[i,3]<-sensitivitymv::truncatedP(tabSen[i,1:2],trunc=1)
tabSen[i,4]<-sensitivitymv::truncatedP(tabSen[i,1:2],trunc=0.2)
}
}
# Table 2
xtable::xtable(t(tabSen),digits=4)
# Compare Table 2 to a sensitivity analysis for 1425 pairs-only
# by randomly selecting 1 of 4 controls from the 1-to-4 sets
set.seed(12345)
a<-sample(2:5,(dim(pd4)[1]),replace=TRUE)
pd4r<-rep(NA,(dim(pd4)[1]))
for (i in 1:(dim(pd4)[1])) pd4r[i] <- pd4[i,a[i]]
pd4r<-cbind(pd4[,1],pd4r)
rm(a)
weightedRank::wgtRank(rbind(pd1,pd4r),phi="u878",gamma=4.2)
weightedRank::wgtRank(rbind(pd1,pd4r),phi="quade",gamma=4)
weightedRank::wgtRank(rbind(pd1,pd4r),phi="quade",gamma=3)
#
# Make Table 3 regarding counterfactual risk
#
ctab<-table(dm2$pd>=20,dm2$grp3)
ctab<-ctab[2:1,]
ctab<-rbind(ctab,prop.table(ctab,2)[1,]*100)
ctab<-rbind(ctab,c(ctab[1,1]*ctab[2,2]/(ctab[1,2]*ctab[2,1]),
mantelhaen.test(table(dm$pd>=20,dm$z,dm$mset))$estimate,
ctab[1,3]*ctab[2,4]/(ctab[1,4]*ctab[2,3]),
mantelhaen.test(table(dm1$pd>=20,dm1$z,dm1$mset))$estimate))
xtable::xtable(ctab,digits=1)
#
# Evidence factors analysis -- cigarettes per day
#
crosscutplot<-function (x, y, ct = 0.25, xlab = "", ylab = "", main = "",
ylim = NULL)
{
stopifnot(is.vector(x))
stopifnot(is.vector(y))
stopifnot(length(x) == length(y))
stopifnot((ct > 0) & (ct <= 0.5))
qx1 <- stats::quantile(x, ct)
qx2 <- stats::quantile(x, 1 - ct)
qy1 <- stats::quantile(y, ct)
qy2 <- stats::quantile(y, 1 - ct)
use <- ((x <= qx1) | (x >= qx2)) & ((y <= qy1) | (y >= qy2))
if (is.null(ylim))
graphics::plot(x, y, xlab = xlab, ylab = ylab, main = main,
type = "n",las=1,cex.lab=.9,cex.axis=.9,,cex.main=.9)
else graphics::plot(x, y, xlab = xlab, ylab = ylab, ylim = ylim,,cex.main=.9,
main = main, type = "n",las=1,cex.lab=.9,cex.axis=.9)
graphics::points(x[use], y[use], pch = 16,cex=.6)
graphics::points(x[!use], y[!use], col = "gray", pch = 16,cex=.6)
graphics::abline(h = c(qy1, qy2))
graphics::abline(v = c(qx1, qx2))
}
dCigs1<-dm$cigsperday[dm$z==1]
dCigs4<-dm1$cigsperday[dm1$z==1]
dif1<-pd1[, 1] - pd1[, 2]
dif4<-pd4[,1]-apply(pd4[,2:5],1,median)
par(mfrow=c(1,2))
crosscutplot(dCigs1,dif1,xlab="Cigarettes per Day",ylim=c(-100,100),
ylab="Periodontal Disease",main="1212 Pairs")
text(70,-80,paste("Odds Ratio =",round(89*135/(84*72),2)),cex=.7)
crosscutplot(dCigs4,dif4,xlab="Cigarettes per Day",ylim=c(-100,100),
ylab="Periodontal Disease",
main="213 Matched 1-to-4 Sets")
text(31,-80,paste("Odds Ratio =",round(28*18/(12*9),2)),cex=.7)
DOS2::crosscut(dCigs1,dif1)
DOS2::crosscut(dCigs4,dif4)
tb<-c(as.vector(DOS2::crosscut(dCigs1,dif1)$table),
as.vector(DOS2::crosscut(dCigs4,dif4)$table))
tb<-array(tb,c(2,2,2))
sensitivity2x2xk::mh(tb,Gamma=1.6)
sensitivity2x2xk::mh(tb[,,1],Gamma=1.375)
sensitivity2x2xk::mh(tb[,,2],Gamma=1.7)
par(old.par)
rm(gammas,ngamma,crosscutplot,tb,i,tabSen,pd4r,old.par,ctab)
Unmatched Periodontal Disease Data
Description
Unmatched data from NHANES 2009-2010, 2011-2012, 2013-2014 concerning smoking and periodontal disease.
Usage
data("PeriUnmatched")
Format
A data frame with 6255 observations on the following 11 variables.
SEQN
NHANES ID number
female
1=female, 0=male
age
Age in years, capped at 80 for confidentiality
ageFloor
Age decade = floor(age/10)
educ
Education as 1 to 5. 1 is less than 9th grade, 2 at least 9th grade with no high school degree, 3 is a high school degree, 4 is some college, such as a 2-year associates degree, 5 is at least a 4-year college degree.
noHS
No high school degree. 1 if educ is 1 or 2, 0 if educ is 3 or more
income
Ratio of family income to the poverty level, capped at 5 for confidenditality
nh
The specific NHANES survey. A factor
nh0910
<nh1112
<nh1314
cigsperday
Number of cigarettes smoked per day. 0 for nonsmokers.
z
Daily smoker. 1 indicates someone who smokes everyday. 0 indicates a never-smoker who smoked fewer than 100 cigarettes in their life.
pd
A percent indicating periodontal disease. See details.
Details
Measurements were made for up to 28 teeth, 14 upper, 14 lower, excluding 4 wisdom teeth. Pocket depth and loss of attachment are two complementary measures of the degree to which the gums have separated from the teeth; see Wei, Barker and Eke (2013). Pocket depth and loss of attachment are measured at six locations on each tooth, providing the tooth is present. A measurement at a location was taken to exhibit disease if it had either a loss of attachement >=4mm or a pocked depth >=4mm, so each tooth contributes six binary scores, up to 6x28=168 binary scores. The variable pd is the percent of these binary scores indicating periodontal disease, 0 to 100 percent.
The data from three NHANES surveys (specifically 2009-2010, 2011-2012, and 2013-2014) contain periodontal data and are used as an example in Rosenbaum (2025). The data from one survey, 2011-2012, were used in Rosenbaum (2016). The example uses these unmatched data twice in artless() to create the fused match in Rosenbaum (2025). The fused match combines some 1-to-1 matched pairs and some 1-to-4 matched sets based on the values of the propensity score. The data are useful in learning about fused matching, but the example in the documentation for artless() should be used as the main example illustrating artless().
Note
An analysis of outcomes should take appropriate account of the matching; see the note in the documentation for PeriMatched. Often, covariate balance is assessed by comparing the marginal distributions of covariates in treated and control groups after matching; however, some care is required when there are both 1-to-1 pairs and 1-to-4 sets. One can assess covariate balance for the pairs, and separately assess covariate balance for the 1-to-4 sets. Alternatively, one can measure covariate balance in the pairs and the 1-to-4 sets separately, perhaps taking the difference in means, and then take a weighted combination of the two differences in means for pairs and 1-to-4 sets, along the lines indicated by Pimentel, Yoon and Keele (2015). However, one cannot assess covariate balance by pooling the two treated groups from pairs and 1-to-4 sets, pooling the two control groups from pairs and 1-to-4 sets, and comparing the two pooled groups. In the example, there is exact matching for sex; however, most pairs are men and most 1-to-4 sets are women. Pool the pairs and the 1-to-4 sets and the pooled control group has proportionately more women than the pooled treated group. To see this, type:
data("PeriMatched")
tapply(PeriMatched$female,PeriMatched$grp3,mean)
tapply(PeriMatched$female,PeriMatched$grp2,mean)
Source
US National Health and Nutrition Examination Survey (NHANES). https://www.cdc.gov/nchs/nhanes/
References
Pimentel, S. D., Yoon, F., & Keele, L. (2015) <doi:10.1002/sim.6593> Variable‐ratio matching with fine balance in a study of the Peer Health Exchange. Statistics in Medicine, 34(30), 4070-4082.
Rosenbaum, P. R. (2016) <doi:10.1214/16-AOAS942> Using Scheffe projections for multiple outcomes in an observational study of smoking and periondontal disease. Annals of Applied Statistics, 10, 1447-1471.
Rosenbaum, Paul R. (2025) A Design for Observational Studies in Which Some People Avoid Treatment. Manuscript.
Tomar, S. L. and Asma, S. (2000). Smoking attributable periodontitis in the United States: Findings from NHANES III. J. Periodont. 71, 743-751.
Wei, L., Barker, L. and Eke, P. (2013). Array applications in determining periodontal disease measurement. SouthEast SAS User's Group. (SESUG2013) Paper CC-15, analytics.ncsu.edu/ sesug/2013/CC-15.pdf.
Examples
# The code below creates the matched data, PeriMatched, from the unmatched
# data PeriUnmatched using the function artless() twice. Individuals
# with prop above 0.15 were matched in pairs. Individuals with prop of at
# most 0.15 were matched in a 1-to-5 ratio.
data(PeriUnmatched)
# Controls matched for female, age, education, income
d0<-PeriUnmatched
prop<-stats::glm(d0$z~d0$female+d0$age+d0$educ+d0$income,family=binomial)$fitted
d0<-cbind(d0,prop)
rm(prop)
# Pair match for higher propensity individuals
d1<-d0[d0$prop>0.15,]
attach(d1)
ageFloor<-floor(age/10)
lowInc<-1*(income<2)
highInc<-1*(income>=4)
x<-cbind(female,age,educ,income)
xm<-cbind(age,educ,income)
near<-cbind(female,ageFloor)
age60<-1*(age>=60)
fine<-cbind(age60,noHS,lowInc,highInc,female)
# Match does the following: estimates a new propensity score in
# this subpopulation using the covariates in x, uses a
# Mahalanobis distance for the covariates in xm, performs near-exact
# matched for the covariates in near, and performs near-fine balancing
# of the covariates in near. The solves rlemon is used because it is
# available in R, but rrelaxiv may be a better choice, though it
# requires a separate installation.
m<-artless(d1,z,x,xm=xm,near=near,fine=fine,solver="rlemon")
detach(d1)
# Some clean-up follows
rm(age60)
dm<-m$match
dm<-dm[!is.na(dm$mset),]
rm(x,xm,fine,near,d1,ageFloor,lowInc,highInc)
treated<-as.vector(rbind(dm$SEQN[dm$z==1],dm$SEQN[dm$z==1]))
dm<-cbind(dm,treated)
rm(treated)
# Now match 1-to-4 for low propensity individuals
d1<-d0[d0$prop<=0.15,]
attach(d1)
ageFloor<-floor(age/10)
lowInc<-1*(income<2)
highInc<-1*(income>=4)
x<-cbind(female,age,educ,income)
xm<-cbind(age,educ,income)
near<-cbind(female,ageFloor)
age60<-1*(age>=60)
fine<-cbind(age60,noHS,lowInc,highInc,female)
ncontrols<-4
# Match does the following: estimates a new propensity score in
# this subpopulation using the covariates in x, uses a
# Mahalanobis distance for the covariates in xm, performs near-exact
# matched for the covariates in near, and performs near-fine balancing
# of the covariates in near. The solves rlemon is used because it is
# available in R, but rrelaxiv may be a better choice, though it
# requires a separate installation.
m1<-artless(d1,z,x,xm=xm,near=near,fine=fine,solver="rlemon",
ncontrols=ncontrols)
detach(d1)
# Some clean-up follows
rm(age60)
dm1<-m1$match
dm1<-dm1[!is.na(dm1$mset),]
rm(x,xm,fine,near,d1,ageFloor,lowInc,highInc)
treated1<-dm1$SEQN[dm1$z==1]
treated<-treated1
for (i in 1:(ncontrols)) treated<-rbind(treated,treated1)
treated<-as.vector(treated)
dm1<-cbind(dm1,treated)
rm(treated,treated1,i,ncontrols)
# Pool the two matched sames into one data.frame dm2
pair<-rep(1,dim(dm)[1])
dm<-cbind(dm,pair)
dm$mset<-as.integer(dm$mset)
pair<-rep(0,dim(dm1)[1])
dm1<-cbind(dm1,pair)
dm1$mset<-as.integer(dm1$mset)+max(dm$mset)
dm2<-rbind(dm1,dm)
rm(pair)
grp2<-factor(dm2$z,levels=1:0,labels=c("S","N"),ordered=TRUE)
grp3<-factor(dm2$pair,levels=c(1,0),labels=c("1-1","1-4"),ordered=TRUE):grp2
dm2<-cbind(dm2,grp2,grp3)
rm(grp2,grp3)
# There are 1212 pairs and 213 1-to-4 sets
table(table(dm2$mset))
# Check the balance tables separately for pairs and sets
# Pairs
m$balance
# 1-to-4 sets
m1$balance
Artless Automatic Matching
Description
Implements a simple version of multivariate matching using a propensity score, near-exact matching, near-fine balance, and robust Mahalanobis distance matching. You specify the variables, and the program does everything else. Should you be artful, not artless? See the notes.
Usage
artless(dat, z, x, xm = NULL, near = NULL, fine = NULL,
ncontrols = 1, rnd = 2, solver="rlemon")
Arguments
dat |
A dataframe containing the data set that will be matched. Let N be the number of rows of dat. |
z |
A binary vector of length N where z[i]=1 if the ith row of dat describes a treated individual and z[i]=0 if the ith row of dat describes a control. |
x |
x is a numeric matrix with N rows. The covariates in x are used to estimate a propensity score using a linear logit model. |
xm |
xm is a numeric matrix with N rows. The covariates in xm are used to define a robust Mahalanobis distance between treated and control individuals. |
near |
A numeric vector of length N or a numeric matrix with N rows. Each column of near should represent levels of a nominal covariate with two or a few levels. The variables in near are used in near-exact matching. |
fine |
A numeric vector of length N or a numeric matrix with N rows. Each column of fine should represent levels of a nominal covariate with two or a few levels. The variables in fine are used in near-fine balancing. |
ncontrols |
A positive integer. ncontrols is the number of controls to be matched to each treated individual. |
rnd |
A nonnegative integer. The balance table is rounded for display to rnd digits. |
solver |
Either "rlemon" or "rrelaxiv". The rlemon solver is automatically available without special installation. The rrelaxiv requires a special installation. See the note. |
Details
This package builds a matched treated-control sample from an unmatched data set. It asks you to designate roles for specific covariates, and it does the rest. It is described as “artless automatic matching” because it makes decisions by default. Perhaps you could make better decisions; if so, perhaps try the iTOS package which gives you much more control over decisions. The package will often create a reasonable matched sample with little effort; however, it also could be used as a first step in learning the art of constructing a matched sample. Wittgenstein spoke of a the “ladder you throw away after you have climbed it,” and the package can also serve that function.
Value
match |
A dataframe containing the matched data set. match contains the rows of dat in a different order. match adds two columns to dat, called mset and matched, which identify matched pairs or matched sets. Specifically, matched is TRUE if a row is in the matched sample and is FALSE otherwise. Rows of dat that are in the same matched set have the same value of mset. The rows of match are sorted by mset with the treated individual before the matched controls. The unmatched controls with matched=FALSE appear as the last rows of match. match also adds the estimated propensity score as a probability pr. When you analyze the matched data, you will want to remove rows of match with matched==FALSE. |
balance |
A matrix called the balance table. The matrix has one row for each covariate in x, xm, near and fine; so, some covariates may be repeated. It also has a first row for the propensity score. There are five columns. Column 1 is the mean of the covariate in the treated group. Column 2 is the mean of the covariate in the matched control group. Column 3 is the mean of the covariate among all controls prior to matching. Column 4 is the difference between columns 1 and 2 divided by a pooled estimate of the standard deviation of the covariate before matching. Column 5 is the difference between columns 1 and 3 divided by a pooled estimate of the standard deviation of the covariate before matching. Notice that columns 4 and 5 have the same denominator, but different numerators. |
Note
– The following are some practical tips on how to use artless.
– Placing a covariate in x means that it is included in the propensity score. Most or all covariates that you want to balance should be placed in x.
– A limited number of nominal covariates with a few levels can be placed in near or in fine. Both near and fine covariates are given overriding importance; so, if you place too many covariates in near or fine, or if they have too many levels, they will override everything else, and the match quality will be poor. The same covariate can appear, perhaps in different forms, in x, xm, near and fine. In the example, a five-level education variable is in x and xm, and a two-level education variable formed from the five-level education variable is in fine.
– An attempt is made to exactly match for covariates in near. In the example, near contains two binary covariates, namely female and dontSmoke. This means that the match will try whenever possible to match women to women and men to men, nonsmokers to nonsmokers, and smokers to smokers. Other considerations are subbordinated to this goal.
– An attempt is made to balance covariates in fine. In the example, fine includes a covariate expressing four broad age categories, one low education category (less than high school), and a binary covariate distinguishing daily-smokers from everyone else. This means that the match will work hard to have the same proportion of people with less-than-high-school education in treated and control groups, but it will not prioritize pairing two people with less-than-high-school education. Although subbordinate to near exact matching, fine balance is given more importance than other considerations.
– Two separate attempts are made to, first, balance the propensity score in the sense of fine balance and to pair closely for the propensity score. More emphasis is given to balancing the propensity score, much less to pairing for it. The match also tries in a limited way to avoid using many controls whose propensity scores are below the minimum propensity score in the treated group.
– An attempt is made to pair closely for covariates in xm; however, this task has the lowest priority of the several goals. A continuous covariate, like age or bmi, might be placed in x and in xm. Covariates in xm are given roughly equal importance, so do not put unimportant covariates in xm.
– The covariates in x could include, say: (i) a quadratic in age, (age-mean(age))^2, (ii) an interaction, (age-mean(age))*(bmi-mean(bmi)), or (iii) spline terms computed from age. The propensity score is fitted as a linear logit model in the covariates in x, but you can fit various nonlinear propensity scores by passing in x various nonlinear transformations of a more limited set of covariates.
– Usually, the first match you construct is imperfect, and you see this in the balance table or in plots of the matched data. So, you make small adjustments to x, xm, near and fine to fix the imperfections. The match should be finalized before any outcome information is examined. Taking the first match without looking at it and improving it is not artless; it is incompetent.
– Once you have developed some experience with the artless function, you may want to learn about other artful tactics that can enhance your ability to remove imperfections in a match. Some of these tactics are implemented in the iTOS package that is called by artless.
– There are treated and control groups that cannot be matched. If all of the treated individuals are under age 20 and all of the controls are over age 50, then there is no way you can match for age. You could do regression or covariance adjustment for age, but of course it would be silly. Matching will often stop you from doing silly things, while regression will let you do silly things.
Note
Should you be artful rather than artless? Essentially, the artless() function is setting priorities by default. This makes artless() easy to use, but its default priorities might not be your priorities. An alternative is to set your own priorities by using the matching methods in, say, the iTOS package. The artless() function calls the functions in the iTOS package, but it sets default priorities when it does this. There are also many more options in the iTOS package.
What can artful use of iTOS do that artless() cannot? artless() automatically sets priorities and penalties, but iTOS lets you adjust them. artless() automatically gives an emphasis to the propensity score, and does this in a particular way, but iTOS lets you decide. The directional penalties of Yu and Rosenbaum (2019) need to be titrated to produce desired effects; they are in iTOS but not in artless(). Near-exact and near-fine matching are implemented for nominal variables in artless(), but iTOS has other options for ordered categories. iTOS lets you give more emphasis to one covariate, less to another, but artless() does this only indirectly through the matrices x, xm, near and fine. In artless() all variables in near are treated as equally important, and all variables in fine are treated as equally important, but iTOS lets you decide. Caliper matching is possible in iTOS but not in artless(). artless() uses the control-control edge costs in Zhang et al. (2023) to avoid low propensity scores in the control group, but iTOS lets you use this feature any way you prefer. The iTOS package is associated with Rosenbaum (2025), especially its Chapters 5 and 6.
Note
This note provides some references and detail about what the package is actually doing. You do not have to read this note to use the package.
Matching using propensity scores and a Mahalanobis distance is discussed in Rosenbaum and Rubin (1985). The robust Mahalanobis distance is discussed in Section 9.3 of Rosenbaum (2020a) and more briefly in Section 4.1 of Rosenbaum (2020b).
Near-exact matching (also known as almost-exact matching) is an attempt to match exactly for a few nominal covariates, while also matching for other things. It is described in Sections 10.3 and 10.4 of Rosenbaum (2020a) and more briefly in Section 4.3 of Rosenbaum (2020b). Near-exact matching is implemented by a large penalty added to a covariate distance: if two people are not exactly matched for a near-exact covariate, then the covariate distance between them is very large. Near-exact matching minimizes the number of individuals who are not exactly matched.
Fine balance attempts to balance a covariate without pairing for it. For example, female is balanced if the treated and control groups have the same proportion of females, but female is exactly matched if females are always matched to females. Fine balance is discussed in Chapter 11 of Rosenbaum (2020a) and more briefly in Section 4.4 of Rosenbaum (2020b). Fine balance was introduced in Section 3.2 of Rosenbaum (1989), and is further developed in Rosenbaum, Ross and Silber (2007). If one seeks a match as close as possible to fine balance, then one is doing near-fine balance. Near-fine balance is often implemented using penalties for imbalances; see Yang et al. (2012), Pimentel et al. (2015) and Zhang et al. (2023).
One can do near-exact matching and fine balancing of the same variable, perhaps leading the proportion of females to be exactly the same in treated and control groups, with pairs matched for female as often as is possible. See Zubizarreta et al. (2011) for discussion.
artless() uses the control-control edge costs in Zhang et al. (2013) to moderately penalize the use of a control whose propensity score is below the minimum propensity score in the treated group. This penalty is smaller than the penalty for near-exact matching and for aspects of propensity score balancing, but it is larger than the penalty for each variable in near-fine matching.
This package implements a very specific version of two-criteria matching from Zhang et al. (2023) using functions from the iTOS package. Two-criteria matching integrates a number of earlier techniques into a single network structure. The package picks several one-size-fits-all penalties for distances for two-criteria matching. An artful match might vary penalties in a thoughtful way to achieve a better, closer, more balanced match with a larger value of ncontrols. The package does not use asymmetric calipers and directional penalties from Yu and Rosenbaum (2019) because these are not easily automated, but the artful use of these techniques can produce a better match.
The package uses optimal matching by minimum cost flow in a network. See Bertsekas (1990) for an introduction to this optimization technique, and see Rosenbaum (1989) for its application to matching in observational studies.
The package indirectly uses the callrelax() function in Samuel Pimentel's rcbalance package. This function was originally intended to call the excellent RELAXIV Fortan code of Bertsekas and Tseng (1988,1994). Unfortunately, that code has an academic license and is not available from CRAN; so, by default it calls the rlemon function instead, which is available at CRAN. If you qualify as an academic, then you may be able to download the RELAXIV code from Github at <https://github.com/josherrickson/rrelaxiv/> and use it in artless by setting solver="rrelaxiv".
artless() uses a dense network, so it can match moderately large data sets, but not very large data sets. For very large data sets, see Yu et al. (2020) and Yu's bigmatch package in R.
Network optimization is only one of several optimization techniques that may be used in multivariate matching. See Niknam and Zubizarreta (2022), Zubizarreta (2012) and Rosenbaum and Zubizarreta (2023).
Author(s)
Paul R. Rosenbaum
References
Bertsekas, D. P., Tseng, P. (1988) <doi:10.1007/BF02288322> The Relax codes for linear minimum cost network flow problems. Annals of Operations Research, 13, 125-190.
Bertsekas, D. P. (1990) <doi:10.1287/inte.20.4.133> The auction algorithm for assignment and other network flow problems: A tutorial. Interfaces, 20(4), 133-149.
Bertsekas, D. P., Tseng, P. (1994) <http://web.mit.edu/dimitrib/www/Bertsekas_Tseng_RELAX4_!994.pdf> RELAX-IV: A Faster Version of the RELAX Code for Solving Minimum Cost Flow Problems.
Greifer, N. and Stuart, E.A., (2021). <doi:10.1093/epirev/mxab003> Matching methods for confounder adjustment: an addition to the epidemiologist’s toolbox. Epidemiologic Reviews, 43(1), pp.118-129.
Hansen, B. B. and Klopfer, S. O. (2006) <doi:10.1198/106186006X137047> "Optimal full matching and related designs via network flows". Journal of computational and Graphical Statistics, 15(3), 609-627. ('optmatch' package)
Hansen, B. B. (2007) <https://www.r-project.org/conferences/useR-2007/program/presentations/hansen.pdf> Flexible, optimal matching for observational studies. R News, 7, 18-24. ('optmatch' package)
Pimentel, S. D., Yoon, F., & Keele, L. (2015) <doi:10.1002/sim.6593> Variable‐ratio matching with fine balance in a study of the Peer Health Exchange. Statistics in Medicine, 34(30), 4070-4082.
Niknam, B.A. and Zubizarreta, J.R. (2022). <10.1001/jama.2021.20555> Using cardinality matching to design balanced and representative samples for observational studies. JAMA, 327(2), pp.173-174.
Pimentel, S. D., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2015) <doi:10.1080/01621459.2014.997879> Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons. Journal of the American Statistical Association, 110, 515-527.
Rosenbaum, P. R. and Rubin, D. B. (1985) <doi:10.1080/00031305.1985.10479383> Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33-38.
Rosenbaum, P. R. (1989) <doi:10.1080/01621459.1989.10478868> Optimal matching for observational studies. Journal of the American Statistical Association, 84(408), 1024-1032.
Rosenbaum, P. R., Ross, R. N. and Silber, J. H. (2007) <doi:10.1198/016214506000001059> Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. Journal of the American Statistical Association, 102, 75-83.
Rosenbaum, P. R. (2020a) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd Edition). New York: Springer.
Rosenbaum, P. R. (2020b). <doi:10.1146/annurev-statistics-031219-041058> Modern algorithms for matching in observational studies. Annual Review of Statistics and Its Application, 7(1), 143-176.
Rosenbaum, P. R. and Zubizarreta, J. R. (2023). <doi:10.1201/9781003102670> Optimization Techniques in Multivariate Matching. Handbook of Matching and Weighting Adjustments for Causal Inference, pp.63-86. Boca Raton: FL: Chapman and Hall/CRC Press.
Rosenbaum, P. R. (2025) Introduction to the Theory of Observational Studies. New York: Springer.
Rubin, D. B. (1980) <doi:10.2307/2529981> Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293-298.
Stuart, E.A., (2010). <doi:10.1214/09-STS313> Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1-21.
Yang, D., Small, D. S., Silber, J. H. and Rosenbaum, P. R. (2012) <doi:10.1111/j.1541-0420.2011.01691.x> Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics, 68, 628-636.
Yu, Ruoqi, and P. R. Rosenbaum. <doi:10.1111/biom.13098> Directional penalties for optimal matching in observational studies. Biometrics 75, no. 4 (2019): 1380-1390.
Yu, R., Silber, J. H., & Rosenbaum, P. R. (2020) <doi:10.1214/19-STS699> Matching methods for observational studies derived from large administrative databases. Statistical Science, 35(3), 338-355.
Yu, R. (2021) <doi:10.1111/biom.13374> Evaluating and improving a matched comparison of antidepressants and bone density. Biometrics, 77(4), 1276-1288.
Yu, R. (2023) <doi:10.1111/biom.13771> How well can fine balance work for covariate balancing? Biometrics. 79(3), 2346-2356.
Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151.
Zubizarreta, J.R., 2012. <doi:10.1080/01621459.2012.703874>Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500), pp.1360-1371.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2011) <doi:10.1198/tas.2011.11072> Matching for several sparse nominal variables in a case control study of readmission following surgery. The American Statistician, 65(4), 229-238.
Zubizarreta, J.R., Stuart, E.A., Small, D.S. and Rosenbaum, P.R. eds. (2023). <doi:10.1201/9781003102670> Handbook of Matching and Weighting Adjustments for Causal Inference. Boca Raton: FL: Chapman and Hall/CRC Press.
Examples
# The example below uses the binge data from the iTOS package.
# See the documentation for binge in the iTOS package for more information.
#
library(iTOS)
data(binge)
b2<-binge[binge$AlcGroup!="P",] # Match binge drinkers to nondrinkers
z<-1*(b2$AlcGroup=="B") # Treatment/control indicator
b2<-cbind(b2,z)
rm(z)
rownames(b2)<-b2$SEQN
attach(b2)
#
agec<-as.integer(ageC)
#
# x contains the variables in the propensity score
#
x<-data.frame(age,female,education,bmi,vigor,smokenow,smokeQuit,bpRX)
#
# Create nominal covariates to include in near or fine
#
smoke<-1*(smokenow==1)
dontSmoke<-1*(smokenow==3)
age50<-1*(age>=50)
bmi30<-1*(bmi>=30)
ed2<-1*(education<=2)
smoke<-1*(smokenow==1)
#
# near contains covariates to be matched as exactly as possible
#
near<-cbind(female,dontSmoke)
#
# xm contains covariates in the robust Mahalanobis distance
# Includes some continuous covariates.
#
xm<-cbind(age,bmi,vigor,smokenow,education)
#
# fine contains covariate that will be balanced, but not matched
#
fine<-cbind(ageC,ed2,smoke,dontSmoke)
rm(agec,bmi30,smoke,ed2,age50)
detach(b2)
mc<-artless(b2,b2$z,x,xm=xm,near=near,fine=fine,ncontrols=3)
#
# Here are the first two 1-to-3 matched sets.
#
mc$match[1:8,]
#
# You can check that every matched set is exactly matched for
# female and nonsmoking. This is from near-exact matching.
# In some other data set, the number of mismatches might be
# minimized, not driven to zero.
#
# The balance table shows that large imbalances in covariates
# existed before matching, but are much smaller after matching.
# Look, for example, at the propensity score, female, and
# the several versions of the smoking variable.
#
mc$balance