Title: | List Balancing for Reweighting and Population Synthesis |
Version: | 1.0.2 |
Description: | Performs iterative proportional updating given a seed table and an arbitrary number of marginal distributions. This is commonly used in population synthesis, survey raking, matrix rebalancing, and other applications. For example, a household survey may be weighted to match the known distribution of households by size from the census. An origin/ destination trip matrix might be balanced to match traffic counts. The approach used by this package is based on a paper from Arizona State University (Ye, Xin, et. al. (2009) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf). Some enhancements have been made to their work including primary and secondary target balance/importance, general marginal agreement, and weight restriction. |
License: | Apache License (== 2.0) |
URL: | https://github.com/dkyleward/ipfr |
BugReports: | https://github.com/dkyleward/ipfr/issues |
Depends: | R (≥ 3.2.0) |
Imports: | dplyr (≥ 0.7.3), ggplot2 (≥ 2.2.1), magrittr (≥ 1.5), tidyr (≥ 0.5.1), mlr (≥ 2.11) |
LazyData: | true |
Suggests: | knitr, rmarkdown, testthat (≥ 2.1.0), covr |
VignetteBuilder: | knitr |
RoxygenNote: | 7.0.2 |
NeedsCompilation: | no |
Packaged: | 2020-04-01 19:42:58 UTC; kyle |
Author: | Kyle Ward [aut, cre, cph], Greg Macfarlane [ctb] |
Maintainer: | Kyle Ward <kyleward084@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2020-04-01 20:20:02 UTC |
ipfr: A package to perform iterative proportional fitting
Description
The main function is ipu
. For a 2D/matrix problem, the
ipu_matrix
function is easier to use. The resulting
weight_tbl
from ipu()
can be fed into synthesize
to generate a synthetic population
Author(s)
Maintainer: Kyle Ward kyleward084@gmail.com [copyright holder]
Other contributors:
Greg Macfarlane gregmacfarlane@byu.edu [contributor]
See Also
Useful links:
Applies an importance weight to an ipfr factor
Description
At lower values of importance, the factor is moved closer to 1.
Usage
adjust_factor(factor, importance)
Arguments
factor |
A correction factor that is calculated using target/current. |
importance |
A |
Value
The adjusted factor.
Balances secondary targets to primary
Description
The average weight per record needed to satisfy targets is computed for both primary and secondary targets. Often, these can be very different, which leads to poor performance. The algorithm must use extremely large or small weights to match the competing goals. The secondary targets are scaled so that they are consistent with the primary targets on this measurement.
Usage
balance_secondary_targets(
primary_targets,
primary_seed,
secondary_targets,
secondary_seed,
secondary_importance,
primary_id
)
Arguments
primary_targets |
A |
primary_seed |
In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair. |
secondary_targets |
Same format as |
secondary_seed |
Most commonly, if the primary_seed describes
households, the secondary seed table would describe the persons in each
household. Must contain the same |
secondary_importance |
A |
primary_id |
The field used to join the primary and secondary seed
tables. Only necessary if |
Details
If multiple geographies are present in the secondary_target table, then balancing is done for each geography separately.
Value
named list
of the secondary targets
Check geo fields
Description
Helper function for check_tables
. Makes sure that geographies
in a seed and target table line up properly.
Usage
check_geo_fields(seed, target, target_name)
Arguments
seed |
seed table to check |
target |
data.frame of a single target table |
target_name |
the name of the target (e.g. size) |
Value
The seed and target table (which may be modified)
Check for missing categories in seed
Description
Helper function for check_tables
.
Usage
check_missing_categories(seed, target, target_name, geo_colname)
Arguments
seed |
seed table to check |
target |
data.frame of a single target table |
target_name |
the name of the target (e.g. size) |
geo_colname |
the name of the geo column in both the |
Value
Nothing. Throws an error if one is found.
Check seed and target tables for completeness
Description
Given seed and targets, checks to make sure that at least one observation of each marginal category exists in the seed table. Otherwise, ipf/ipu would produce wrong answers without throwing errors.
Usage
check_tables(
primary_seed,
primary_targets,
secondary_seed = NULL,
secondary_targets = NULL,
primary_id
)
Arguments
primary_seed |
In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair. |
primary_targets |
A |
secondary_seed |
Most commonly, if the primary_seed describes
households, the secondary seed table would describe the persons in each
household. Must contain the same |
secondary_targets |
Same format as |
primary_id |
The field used to join the primary and secondary seed
tables. Only necessary if |
Value
both seed tables and target lists
Compare results to targets
Description
Compare results to targets
Usage
compare_results(seed, targets)
Arguments
seed |
|
targets |
|
Value
data frame
comparing balanced results to targets
Create a named list of target priority levels.
Description
Create a named list of target priority levels.
Usage
create_target_priority(target_priority, targets)
Arguments
target_priority |
This argument controls how quickly each set of
targets is relaxed. In other words: how important it is to match the target
exactly. Defaults to
|
targets |
The complete list of targets (both primary and secondary) |
Re-weight a Seed Table to Marginal Controls
Description
Re-weight a Seed Table to Marginal Controls
Usage
ipf(
seed,
targets,
relative_gap = 0.01,
absolute_gap = 1,
max_iterations = 50,
min_weight = 1e-04,
verbose = FALSE
)
Arguments
seed |
A |
targets |
A |
relative_gap |
target for convergence. Maximum percent change to allow
any seed weight to move by while considering the process converged. By
default, if no weights change by more than 1
The process is said to be converged if either |
absolute_gap |
target for convergence. Maximum absolute change to allow
any seed weight to move by while considering the process converged. By
default, if no weights change by more than 10, the process has converged.
The process is said to be converged if either |
max_iterations |
maximum number of iterations to perform, even if convergence is not reached. |
min_weight |
Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights. |
verbose |
Print details on the maximum expansion factor with each
iteration? Default |
Value
the seed data frame
with a column of weights appended for each
row in the target data.frames
Iterative Proportional Updating
Description
A general case of iterative proportional fitting. It can satisfy two, disparate sets of marginals that do not agree on a single total. A common example is balancing population data using household- and person-level marginal controls. This could be for survey expansion or synthetic population creation. The second set of marginal/seed data is optional, meaning it can also be used for more basic IPF tasks.
Usage
ipu(
primary_seed,
primary_targets,
secondary_seed = NULL,
secondary_targets = NULL,
primary_id = "id",
secondary_importance = 1,
relative_gap = 0.01,
max_iterations = 100,
absolute_diff = 10,
weight_floor = 1e-05,
verbose = FALSE,
max_ratio = 10000,
min_ratio = 1e-04
)
Arguments
primary_seed |
In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair. |
primary_targets |
A |
secondary_seed |
Most commonly, if the primary_seed describes
households, the secondary seed table would describe the persons in each
household. Must contain the same |
secondary_targets |
Same format as |
primary_id |
The field used to join the primary and secondary seed
tables. Only necessary if |
secondary_importance |
A |
relative_gap |
After each iteration, the weights are compared to the
previous weights and the
the |
max_iterations |
maximum number of iterations to perform, even if
|
absolute_diff |
Upon completion, the For example, if if a target value was 2, and the expanded weights equaled 1, that's a 100 is only 1. Defaults to 10. |
weight_floor |
Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights. |
verbose |
Print iteration details and worst marginal stats upon
completion? Default |
max_ratio |
|
min_ratio |
|
Value
a named list
with the primary_seed
with weight, a
histogram of the weight distribution, and two comparison tables to aid in
reporting.
References
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf
Examples
hh_seed <- dplyr::tibble(
id = c(1, 2, 3, 4),
siz = c(1, 2, 2, 1),
weight = c(1, 1, 1, 1),
geo_cluster = c(1, 1, 2, 2)
)
hh_targets <- list()
hh_targets$siz <- dplyr::tibble(
geo_cluster = c(1, 2),
`1` = c(75, 100),
`2` = c(25, 150)
)
result <- ipu(hh_seed, hh_targets, max_iterations = 5)
Balance a matrix given row and column targets
Description
This function simplifies the call to 'ipu()' for the simple case of a matrix and row/column targets.
Usage
ipu_matrix(mtx, row_targets, column_targets, ...)
Arguments
mtx |
a |
row_targets |
a vector of targets that the row sums must match |
column_targets |
a vector of targets that the column sums must match |
... |
additional arguments that are passed to 'ipu()'. See
|
Value
A matrix
that matches row and column targets
Examples
mtx <- matrix(data = runif(9), nrow = 3, ncol = 3)
row_targets <- c(3, 4, 5)
column_targets <- c(5, 4, 3)
ipu_matrix(mtx, row_targets, column_targets)
Iterative Proportional Updating (Newton-Raphson)
Description
List balancing similar to ipu
, but using the
Newton-Raphson approach to optimization. Created primarily as a point of
comparison for ipu
.
Usage
ipu_nr(
primary_seed,
primary_targets,
secondary_seed = NULL,
secondary_targets = NULL,
target_priority = 1e+07,
relative_gap = 0.01,
max_iterations = 100,
absolute_diff = 10,
weight_floor = 1e-05,
verbose = FALSE,
max_ratio = 10000,
min_ratio = 1e-04
)
Arguments
primary_seed |
In population synthesis or household survey expansion,
this would be the household seed table (each record would represent a
household). It could also be a trip table, where each row represents an
origin-destination pair. Must contain a |
primary_targets |
A |
secondary_seed |
Most commonly, if the primary_seed describes households, the
secondary seed table would describe a unique person with each row. Must
also contain the |
secondary_targets |
Same format as |
target_priority |
This argument controls how quickly each set of
targets is relaxed. In other words: how important it is to match the target
exactly. Defaults to
|
relative_gap |
After each iteration, the weights are compared to the
previous weights and the
the |
max_iterations |
maximum number of iterations to perform, even if
|
absolute_diff |
Upon completion, the For example, if if a target value was 2, and the expanded weights equaled 1, that's a 100 is only 1. Defaults to 10. |
weight_floor |
Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights. |
verbose |
Print iteration details and worst marginal stats upon
completion? Default |
max_ratio |
|
min_ratio |
|
Value
a named list
with the primary_seed
with weight, a
histogram of the weight distribution, and two comparison tables to aid in
reporting.
Helper function to process a seed table
Description
Helper for ipu()
. Strips columns from seed table except for the
primary id and marginal column (as reflected in the targets tables). Also
identifies factor columns with one level and processes them before
mlr::createDummyFeatures()
is called.
Usage
process_seed_table(df, primary_id, marginal_columns)
Arguments
df |
the |
primary_id |
the name of the primary ID column. |
marginal_columns |
The vector of column names in the seed table that have matching targets. |
Scale targets to ensure consistency
Description
Often, different marginals may disagree on the total number of units. In the context of household survey expansion, for example, one marginal might say there are 100k households while another says there are 101k. This function solves the problem by scaling all target tables to match the first target table provided.
Usage
scale_targets(targets, verbose = FALSE)
Arguments
targets |
|
verbose |
|
Value
A named list
with the scaled targets
Create the ASU example
Description
Sets up the Arizona example IPU problem and is used in multiple places throughout the package (vignettes/tests).
Usage
setup_arizona()
Value
A list of four variables:
hh_seed, hh_targets, per_seed, and per_targets. These can be used directly
by ipu
.
Examples
setup_arizona()
Creates a synthetic population based on ipu results
Description
A simple function that takes the weight_tbl
output from
ipu
and randomly samples based on the weight.
Usage
synthesize(weight_tbl, group_by = NULL, primary_id = "id")
Arguments
weight_tbl |
the |
group_by |
if provided, the |
primary_id |
The field used to join the primary and secondary seed
tables. Only necessary if |
Value
A data.frame
with one record for each synthesized member of
the population (e.g. household). A new_id
column is created, but
the previous primary_id
column is maintained to facilitate joining
back to other data sources (e.g. a person attribute table).
Examples
hh_seed <- dplyr::tibble(
id = c(1, 2, 3, 4),
siz = c(1, 2, 2, 1),
weight = c(1, 1, 1, 1),
geo_cluster = c(1, 1, 2, 2)
)
hh_targets <- list()
hh_targets$siz <- dplyr::tibble(
geo_cluster = c(1, 2),
`1` = c(75, 100),
`2` = c(25, 150)
)
result <- ipu(hh_seed, hh_targets, max_iterations = 5)
synthesize(result$weight_tbl, "geo_cluster")