Type: Package
Title: Split-Apply-Combine with Dynamic Groups
Version: 1.0.0
Maintainer: Mark van der Loo <mark.vanderloo@gmail.com>
Description: Estimate group aggregates, where one can set user-defined conditions that each group of records must satisfy to be suitable for aggregation. If a group of records is not suitable, it is expanded using a collapsing scheme defined by the user. A paper on this package was published in the Journal of Statistical Software <doi:10.18637/jss.v112.i04>.
License: EUPL version 1.1 | EUPL version 1.2 [expanded from: EUPL]
URL: https://github.com/markvanderloo/accumulate
LazyData: TRUE
VignetteBuilder: simplermarkdown
Depends: R (≥ 3.5.0)
Suggests: tinytest, simplermarkdown, validate
Encoding: UTF-8
RoxygenNote: 7.3.2
NeedsCompilation: no
Packaged: 2025-03-18 16:34:26 UTC; mark
Author: Mark van der Loo ORCID iD [aut, cre]
Repository: CRAN
Date/Publication: 2025-03-18 21:10:02 UTC

Split-Apply-Combine with Collapsing Groups

Description

Compute grouped aggregates. If a group does not satisfy certain user-defined conditions (such as too many missings, or not enough records) then the group is expanded according to a user-defined 'collapsing' scheme. This happens recursively until either the group satisfies all conditions and the aggregate is computed, or we run out of collapsing possibilities and the NA is returned for that group.

Usage

accumulate(data, collapse, test, fun, ...)

cumulate(data, collapse, test, ...)

Arguments

data

[data.frame] The data to aggregate by (collapsing) groups.

collapse

[formula|data.frame] representing a group collapsing sequence. See below for details on how to specify each option.

test

[function] A function that takes a subset of data and returns TRUE if it is suitable for computing the desired aggregates and FALSE if a collapsing step is necessary.

fun

[function] A scalar function that will be applied to all columns of data.

...

For accumulate, extra arguments to be passed to fun. For cumulate, a comma-separated list of name=expression, where expression defines the aggregating operation.

Value

A data frame where each row represents a (multivariate) group. The first columns contain the grouping variables. The next column is called level and indicates to what level collapsing was necessary to compute a value, where 0 means that no collapsing was necessary. The following colummns contain the aggregates defined in the ... argument. If no amount of collapsing yields a data set that is satisfactory according to test, then for that row, the level and subsequent columns are NA.

Using a formula to define the collapsing sequence

If all combinations of collapsing options are stored as columns in data, the formula interface can be used. An example is the easiest way to see how it works. Suppose that collapse = A*B ~ A1*B + B This means:

Generally, the formula must be of the form X0 ~ X1 + X2 + ... + Xn where each Xi is a (product of) grouping variable(s) in the data set.

Using a data frame to define the collapsing scheme

In this case collapse is a data frame with columns [A0, A1, ..., An]. The variable A0 represents the most fine-grained grouping and must also be present in data. Aggregation works as follows.

References

MPJ van der Loo (2025) Split-Apply-Combine with Dynamic Grouping Journal of Statistical Software doi:10.18637/jss.v112.i04.

Examples


## Example of data frame defining collapsing scheme, using accumulate

input    <- data.frame(Y1 = 2^(0:8), Y2 = 2^(0:8))
input$Y2[c(1,4,7)] <- NA
# make sure that the input data also has the most fine-graind (target)
# grouping variable
input$A0 <- c(123,123,123,135,136,137,212,213,225)

# define collapsing sequence
collapse <- data.frame(
     A0   = c(123, 135, 136, 137, 212, 213, 225)
   , A1   = c(12 , 13 , 13 , 13 , 21 , 21 , 22 )
   , A2   = c(1  , 1  , 1  , 1  , 2  , 2  , 2  )
)

accumulate(input
 , collapse
 , test = function(d) nrow(d)>=3
 , fun  = sum, na.rm=TRUE)


## Example of formula defining collapsing scheme, using cumulate
input <- data.frame(
   A  = c(1,1,1,2,2,2,3,3,3)
 , B  = c(11,11,11,12,12,13,21,22,12)
 , B1 = c(1,1,1,1,1,1,2,2,1)
 , Y  = 2^(0:8)
)
cumulate(input, collapse=A*B ~ A*B1 + A
        , test = function(d) nrow(d) >= 3
        , tY = sum(Y))


## Example with formula defining collapsing scheme, using accumulate
# The collapsing scheme must be represented by variables in the 
# data. All columns not part of the collapsing scheme will be aggregated
# over.

input <- data.frame(
    A  = c(1,1,1,2,2,2,3,3,3)
  , B  = c(11,11,11,12,12,13,21,22,12)
  , B1 = c(1,1,1,1,1,1,2,2,1)
  , Y1 = 2^(0:8)
  , Y2 = 2^(0:8)
)

input$Y2[c(1,4,7)] <- NA

accumulate(input
 , collapse = A*B ~ A*B1 + A
 , test=function(a) nrow(a)>=3
 , fun = sum, na.rm=TRUE)



## Example with data.frame defining collapsing scheme, using cumulate
dat <- data.frame(A0 = c("11","12","11","22"), Y = c(2,4,6,8))
# collapsing scheme
csh <- data.frame(
   A0 = c("11","12","22")
 , A1 = c("1" ,"1", "2") 
)
cumulate(data = dat
   , collapse = csh
   , test     = function(d) if (nrow(d)<2) FALSE else TRUE
   , mn = mean(Y, na.rm=TRUE)
   , md = median(Y, na.rm=TRUE)
)


Derive collapsing scheme from a hierarchical classification

Description

Derive a collapsing scheme where group labels collapse to their parents in the hierarchy.

Usage

csh_from_digits(x, levels = max(nchar(x)) - 1)

Arguments

x

[character|integer] labels in a hierarchical classification (lowest level)

levels

[integer >=0] how many collapsing levels to include. Zero means only include the original labels.

Value

A data frame where each consecitive pair of columns represents one collapsing step induced by the hierarchical classification encoded by the digits in x.

Examples

# balanced hierarchical classification
csh_from_digits(c("111","112","121","122","123"))
csh_from_digits(c("111","112","121","122","123"),levels=1)

# unbalanced hierarchical classification
csh_from_digits(c("111","112","121","122","1221","1222"))
csh_from_digits(c("111","112","121","122","1221","1222"),levels=2)


Demand minimal fraction of complete records

Description

Demand minimal fraction of complete records

Usage

frac_complete(r, vars = TRUE)

Arguments

r

Minimal fraction of records that must be complete.

vars

[TRUE|column index] Column index into the data to be tested (e.g. a character vectod with variable names or a numeric vector with column positions). The indexed columns will be testsed for completeness (absence of NA). Be default vars=TRUE meaning that all columns are taken into account.

Value

a function that accepts a data frame and returns TRUE when the fraction of complete records is larger than or equal to n and otherwise FALSE.

See Also

Other helpers: min_complete(), min_records()

Examples


f <- frac_complete(0.1)
f(mtcars) # TRUE (all complete)
mt <- mtcars
mt[1:5,1] <- NA
f(mt)     # FALSE (5/32 incomplete)


Use a validate::validator object to define a test

Description

Create a test function that accepts a data.frame, and returns TRUE when the data passes all checks defined in the validator object, and otherwise FALSE.

Usage

from_validator(v, ...)

Arguments

v

[validator] a validator object from the validate package.

...

options passed to validate::confront

Value

a function that accepts a data fram and returns TRUE when the data passes all checks in v and otherwise FALSE.

Note

Requires the validate package to be installed.

References

Mark P. J. van der Loo, Edwin de Jonge (2021). Data Validation Infrastructure for R. Journal of Statistical Software, 97(10), 1-31. doi:10.18637/jss.v097.i10

Examples


if (requireNamespace("validate", quietly=TRUE)){
 v <- validate::validator(height >= 0, weight >= 0)
 f <- from_validator(v)
 f(women)  # TRUE (all heights and weights are nonnegative)
}



Demand minimal number of complete records

Description

Demand minimal number of complete records

Usage

min_complete(n, vars = TRUE)

Arguments

n

Minimal number of records that must be complete

vars

[TRUE|column index] Column index into the data to be tested (e.g. a character vectod with variable names or a numeric vector with column positions). The indexed columns will be testsed for completeness (absence of NA). Be default vars=TRUE meaning that all columns are taken into account.

Value

a function that accepts a data frame and returns TRUE when the number of complete records is larger than or equal to n and otherwise FALSE.

See Also

Other helpers: frac_complete(), min_records()

Examples


f <- min_complete(20)
f(women)  # FALSE (15 records)
f(mtcars) # TRUE (32 records)


Demand minimal number of records

Description

Demand minimal number of records

Usage

min_records(n)

Arguments

n

Minimal number of records in a group.

Value

a function that accepts a data frame and returns TRUE when the number of records is larger than or equal to n and otherwise FALSE.

See Also

Other helpers: frac_complete(), min_complete()

Examples


min_records(5)(women)
min_records(200)(women)


Create a classed list

Description

Classed lists are used to pretty-print a list that is stored in a data frame.

Usage

object_list(x)

## S3 method for class 'object_list'
format(x, ...)

## S3 method for class 'object_list'
print(x, ...)

## S3 method for class 'object_list'
x[i, j, ..., drop = TRUE]

Arguments

x

a list

Examples

object_list(list(lm(speed ~ dist, data=cars)))


Synthetic data on producers

Description

A synthetic dataset listing several sources of turnover and other income for producers. The producers are classified in size classes and SBI (a refinement of NACE). Load with data(producers).

Format

A .rda file, one producer per row.


Check your testing function against common edge cases

Description

Writing a testing function that works on any subset of records of a dataframe can be quite subtle. This function tries the testing function on a number of common (edge) cases that are easily overlooked. It is not a unit test: a smoke test will not tell you whether your output is correct. It only checks the output data type (must be TRUE or FALSE and reports if errors, warnings, or messages occur.

Usage

smoke_test(dat, test, verbose = FALSE, halt = TRUE)

Arguments

dat

an example dataset. For example the full dataset to be fed into accumulate or cumulate.

test

A testing function to be passed as argument to accumulate or cumulate.

verbose

[logical] If TRUE, all results (including passed tests) are printed. If FALSE only failed tests are printed.

halt

[logical] toggle stopping when an error is thrown

Value

NULL, invisibly. This function has as side-effect that test results are printed to screen.

Examples

dat <- data.frame(x = 1:5, y=(-2):2)
smoke_test(dat, function(d) y > 0)   #error: Y not found
smoke_test(dat, function(d) d$y > 0) # issue: output too long, not robust against NA
smoke_test(dat, function(d) sum(d$y > 0) > 2) # issue: not robust against NA
smoke_test(dat, function(d) sum(d$y > 0, na.rm=TRUE) > 2) # OK