Help for package autoEnsemble

Type:

Package

Title:

Automated Stacked Ensemble Classifier for Severe Class Imbalance

Version:

0.3

Depends:

R (≥ 3.5.0),

Description:

A stacking solution for modeling imbalanced and severely skewed data. It automates the process of building homogeneous or heterogeneous stacked ensemble models by selecting "best" models according to different criteria. In doing so, it strategically searches for and selects diverse, high-performing base-learners to construct ensemble models optimized for skewed data. This package is particularly useful for addressing class imbalance in datasets, ensuring robust and effective model outcomes through advanced ensemble strategies which aim to stabilize the model, reduce its overfitting, and further improve its generalizability.

License:

MIT + file LICENSE

Encoding:

UTF-8

Imports:

h2o (≥ 3.34.0.0), h2otools (≥ 0.3), curl (≥ 4.3.0)

RoxygenNote:

7.3.2

URL:

https://github.com/haghish/autoEnsemble, https://www.sv.uio.no/psi/english/people/academic/haghish/

BugReports:

https://github.com/haghish/autoEnsemble/issues

NeedsCompilation:

Packaged:

2025-03-20 09:48:25 UTC; haghish

Author:

E. F. Haghish [aut, cre, cph]

Maintainer:

E. F. Haghish <haghish@hotmail.com>

Repository:

CRAN

Date/Publication:

2025-03-20 11:50:13 UTC

Automatically Trains H2O Models and Builds a Stacked Ensemble Model

Description

Automatically trains various algorithms to build base-learners and then automatically creates a stacked ensemble model

Usage

autoEnsemble(
  x,
  y,
  training_frame,
  validation_frame = NULL,
  nfolds = 10,
  balance_classes = TRUE,
  max_runtime_secs = NULL,
  max_runtime_secs_per_model = NULL,
  max_models = NULL,
  sort_metric = "AUCPR",
  include_algos = c("GLM", "DeepLearning", "DRF", "XGBoost", "GBM"),
  save_models = FALSE,
  directory = paste("autoEnsemble", format(Sys.time(), "%d-%m-%y-%H:%M")),
  ...,
  newdata = NULL,
  family = "binary",
  strategy = c("search"),
  model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
  min_improvement = 1e-05,
  max = NULL,
  top_rank = seq(0.01, 0.99, 0.01),
  stop_rounds = 3,
  reset_stop_rounds = TRUE,
  stop_metric = "auc",
  seed = -1,
  verbatim = FALSE,
  startH2O = FALSE,
  nthreads = NULL,
  max_mem_size = NULL,
  min_mem_size = NULL,
  ignore_config = FALSE,
  bind_to_localhost = FALSE,
  insecure = TRUE
)

Arguments

x

Vector. Predictor column names or indices.

y

Character. The response column name or index.

training_frame

An H2OFrame containing the training data. Default is h2o.getFrame("hmda.train.hex").

validation_frame

An H2OFrame for early stopping. Default is NULL.

nfolds

Integer. Number of folds for cross-validation. Default is 10.

balance_classes

Logical. Specify whether to oversample the minority classes to balance the class distribution; only applicable to classification

max_runtime_secs

Integer. This argument specifies the maximum time that the AutoML process will run for in seconds.

max_runtime_secs_per_model

Maximum runtime in seconds dedicated to each individual model training process.

max_models

Maximum number of models to build in the AutoML training (passed to autoML)

sort_metric

Metric to sort the leaderboard by (passed to autoML). For binomial classification choose between "AUC", "AUCPR", "logloss", "mean_per_class_error", "RMSE", "MSE". For regression choose between "mean_residual_deviance", "RMSE", "MSE", "MAE", and "RMSLE". For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". Default is "AUTO". If set to "AUTO", then "AUC" will be used for binomial classification, "mean_per_class_error" for multinomial classification, and "mean_residual_deviance" for regression.

include_algos

Vector of character strings naming the algorithms to restrict to during the model-building phase. this argument is passed to autoML.

save_models

Logical. if TRUE, the models trained will be stored locally

directory

path to a local directory to store the trained models

...

parameters to be passed to autoML algorithm in h2o package

newdata

h2o frame (data.frame). the data.frame must be already uploaded on h2o server (cloud). when specified, this dataset will be used for evaluating the models. if not specified, model performance on the training dataset will be reported.

family

model family. currently only "binary" classification models are supported.

strategy

character. the current available strategies are "search" (default) and "top". The "search" strategy searches for the best combination of top-performing diverse models whereas the "top" strategy is more simplified and just combines the specified of top-performing diverse models without examining the possibility of improving the model by searching for larger number of models that can further improve the model. generally, the "search" strategy is preferable, unless the computation runtime is too large and optimization is not possible.

model_selection_criteria

character, specifying the performance metrics that should be taken into consideration for model selection. the default are "c('auc', 'aucpr', 'mcc', 'f2')". other possible criteria are "'f1point5', 'f3', 'f4', 'f5', 'kappa', 'mean_per_class_error', 'gini', 'accuracy'", which are also provided by the "evaluate" function.

min_improvement

numeric. specifies the minimum improvement in model evaluation metric to qualify further optimization search.

max

integer. specifies maximum number of models for each criteria to be extracted. the default value is the "top_rank" percentage for each model selection criteria.

top_rank

numeric vector. specifies percentage of the top models taht should be selected. if the strategy is "search", the algorithm searches for the best best combination of the models from top ranked models to the bottom. however, if the strategy is "top", only the first value of the vector is used (default value is top 1%).

stop_rounds

integer. number of stoping rounds, in case the model stops improving

reset_stop_rounds

logical. if TRUE, everytime the model improves the stopping rounds penalty is resets to 0.

stop_metric

character. model stopping metric. the default is "auc", but "aucpr" and "mcc" are also available.

seed

random seed (recommended)

verbatim

logical. if TRUE, it reports additional information about the progress of the model training, particularly used for debugging.

startH2O

Logical. if TRUE, h2o server will be initiated.

nthreads