Type: Package
Title: 'SplitWise': Hybrid Stepwise Regression with Single-Split Dummy Encoding
Version: 1.0.0
Description: Implements 'SplitWise', a hybrid regression approach that transforms numeric variables into either single-split (0/1) dummy variables or retains them as continuous predictors. The transformation is followed by stepwise selection to identify the most relevant variables. The default 'iterative' mode adaptively explores partial synergies among variables to enhance model performance, while an alternative 'univariate' mode applies simpler transformations independently to each predictor. For details, see Kurbucz et al. (2025) <doi:10.48550/arXiv.2505.15423>.
License: GPL (≥ 3)
Encoding: UTF-8
Depends: R (≥ 3.5.0)
Imports: rpart, stats
RoxygenNote: 7.3.2
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0)
Config/testthat/edition: 3
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-05-26 20:17:10 UTC; Marcell
Author: Marcell T. Kurbucz [aut, cre], Nikolaos Tzivanakis [aut], Nilufer Sari Aslam [aut], Adam Sykulski [aut]
Maintainer: Marcell T. Kurbucz <m.kurbucz@ucl.ac.uk>
Repository: CRAN
Date/Publication: 2025-05-28 16:00:02 UTC

Decide Variable Type (Iterative)

Description

A stepwise variable-selection method that iteratively chooses each variable's best form: "linear", single-split "dummy", or double-split ("middle=1") dummy, based on AIC/BIC improvement. Supports "forward", "backward", or "both" strategies.

Usage

decide_variable_type_iterative(
  X,
  Y,
  minsplit = 5,
  direction = c("backward", "forward", "both"),
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE,
  ...
)

Arguments

X

A data frame of predictors (no response).

Y

A numeric vector (the response).

minsplit

Minimum number of observations in a node to consider splitting. Default = 5.

direction

Stepwise strategy: "forward", "backward", or "both". Default = "backward".

criterion

A character string: either "AIC" or "BIC". Default = "AIC".

exclude_vars

A character vector of variable names to exclude from dummy transformations. These variables will always be treated as linear. Default = NULL.

verbose

Logical; if TRUE, prints messages for debugging. Default = FALSE.

...

Additional arguments (currently unused).

Details

Dummy forms come from a shallow (maxdepth = 2) rpart tree fit to the partial residuals of the current model. We extract up to two splits:

The function then picks the form (linear, single-split dummy, or double-split dummy) that yields the lowest AIC/BIC. Variables listed in exclude_vars will be forced to remain linear (dummy transformations are never attempted).

Value

A named list of decisions, where each element is a list with:

type

Either "linear" or "dummy".

cutoff

A numeric vector of length 1 or 2 (the chosen split points).


Decide Variable Type (Univariate)

Description

For each numeric predictor, this function fits a shallow (maxdepth = 2) rpart tree directly on Y ~ x and tests whether a dummy transformation improves model fit.

Usage

decide_variable_type_univariate(
  X,
  Y,
  minsplit = 5,
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE
)

Arguments

X

A data frame of numeric predictors (no response).

Y

A numeric response vector.

minsplit

Minimum number of observations in a node to consider splitting. Default = 5.

criterion

A character string: either "AIC" or "BIC". Default = "AIC".

exclude_vars

A character vector of variable names to exclude from dummy transformations. These variables will always be treated as linear. Default = NULL.

verbose

Logical; if TRUE, prints messages for debugging. Default = FALSE.

Details

Dummy forms come from a shallow (maxdepth = 2) rpart tree fit to the data. We extract up to two splits:

The function then picks the form (linear, single-split dummy, or double-split dummy) that yields the lowest AIC/BIC. If a variable is listed in exclude_vars, it will always be used as a linear predictor (dummy transformation is never attempted).

Value

A named list of decisions, where each element is a list with:

type

Either "dummy" or "linear".

cutoffs

A numeric vector (length 1 or 2) if type = "dummy", or NULL if linear.

tree_model

The fitted rpart model (for reference) or NULL if excluded.


SplitWise Regression

Description

Transforms each numeric variable into either a single-split dummy or keeps it linear, then runs stats::step() for stepwise selection. The user can choose a simpler univariate transformation or an iterative approach.

Usage

splitwise(
  formula,
  data,
  transformation_mode = c("iterative", "univariate"),
  direction = c("backward", "forward", "both"),
  minsplit = 5,
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE,
  trace = 1,
  steps = 1000,
  k = 2,
  ...
)

## S3 method for class 'splitwise_lm'
print(x, ...)

## S3 method for class 'splitwise_lm'
summary(object, ...)

Arguments

formula

A formula specifying the response and (initial) predictors, e.g. mpg ~ ..

data

A data frame containing the variables used in formula.

transformation_mode

Either "iterative" or "univariate". Default = "iterative".

direction

Stepwise direction: "backward", "forward", or "both".

minsplit

Minimum number of observations in a node to consider splitting. Default = 5.

criterion

Either "AIC" or "BIC". Default = "AIC". Note: If you choose "BIC", you typically want k = log(nrow(data)) in stepwise.

exclude_vars

A character vector naming variables that should be forced to remain linear (i.e., no dummy splits allowed). Default = NULL.

verbose

Logical; if TRUE, prints debug info in transformation steps. Default = FALSE.

trace

If positive, step() prints info at each step. Default = 1.

steps

Maximum number of steps for step(). Default = 1000.

k

Penalty multiple for the number of degrees of freedom (used by step()). E.g. 2 for AIC, log(n) for BIC. Default = 2.

...

Additional arguments passed to summary.lm.

x

A "splitwise_lm" object returned by splitwise.

object

A "splitwise_lm" object returned by splitwise.

Value

An S3 object of class c("splitwise_lm", "lm"), storing:

splitwise_info

List containing transformation decisions, final data, and call.

Functions

Examples

# Load the mtcars dataset
data(mtcars)

# Univariate transformations (AIC-based, backward stepwise)
model_uni <- splitwise(
  mpg ~ .,
  data               = mtcars,
  transformation_mode = "univariate",
  direction           = "backward",
  trace               = 0
)
summary(model_uni)

# Iterative approach (BIC-based, forward stepwise)
# Note: typically set k = log(nrow(mtcars)) for BIC in step().
model_iter <- splitwise(
  mpg ~ .,
  data               = mtcars,
  transformation_mode = "iterative",
  direction           = "forward",
  criterion           = "BIC",
  k                   = log(nrow(mtcars)),
  trace               = 0
)
summary(model_iter)


Transform Features (Iterative Logic)

Description

Once decide_variable_type_iterative has chosen which variables to add (and how), we can build a final data frame from those decisions.

Usage

transform_features_iterative(X, decisions)

Arguments

X

Original predictor data frame.

decisions

Output of decide_variable_type_iterative.

Value

A data frame with the chosen variables in their final forms (dummy or linear).


Transform Features (Univariate Logic)

Description

Given the decisions (dummy or linear) for each predictor, produce a transformed data frame. Dummy columns are 0/1 based on the cutoff.

Usage

transform_features_univariate(X, decisions)

Arguments

X

Original predictor data frame.

decisions

The list returned by decide_variable_type_univariate.

Value

A new data frame with either the original column or a dummy column for each variable.