Version: 1.0.2
Date: 2025-04-09
Title: Machine Learning Immunogenicity and Vaccine Response Analysis
Author: Ivan Tomic ORCID iD [aut, cre, cph], Adriana Tomic ORCID iD [aut, ctb, cph, fnd], Stephanie Hao ORCID iD [aut]
Description: Used for analyzing immune responses and predicting vaccine efficacy using machine learning and advanced data processing techniques. 'Immunaut' integrates both unsupervised and supervised learning methods, managing outliers and capturing immune response variability. It performs multiple rounds of predictive model testing to identify robust immunogenicity signatures that can predict vaccine responsiveness. The platform is designed to handle high-dimensional immune data, enabling researchers to uncover immune predictors and refine personalized vaccination strategies across diverse populations.
Maintainer: Ivan Tomic <info@ivantomic.com>
Packaged: 2025-04-09 14:50:39 UTC; login
Imports: cluster, plyr, dplyr, caret, pROC, PRROC, stats, rlang, Rtsne, dbscan, FNN, igraph, fpc, mclust, ggplot2, grDevices, RColorBrewer, R.utils, clusterSim, parallel, doParallel
Depends: R (≥ 3.4.0)
URL: https://github.com/atomiclaboratory/immunaut, <https://atomic-lab.org>
BugReports: https://github.com/atomiclaboratory/immunaut/issues
License: GPL-3
Encoding: UTF-8
LazyLoad: yes
LazyData: yes
RoxygenNote: 7.3.2.9000
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2025-04-09 17:10:02 UTC

Automated Machine Learning Model Building

Description

This function automates the process of building machine learning models using the caret package. It supports both binary and multi-class classification and allows users to specify a list of machine learning algorithms to be trained on the dataset. The function splits the dataset into training and testing sets, applies preprocessing steps, and trains models using cross-validation. It computes relevant performance metrics such as confusion matrix, AUROC (for binary classification), and prAUC (for binary classification).

Usage

auto_simon_ml(dataset_ml, settings)

Arguments

dataset_ml

A data frame containing the dataset for training. All columns except the outcome column should contain the features.

settings

A list containing the following parameters:

  • outcome: A string specifying the name of the outcome column in dataset_ml. Defaults to "immunaut" if not provided.

  • excludedColumns: A vector of column names to be excluded from the training data. Defaults to NULL.

  • preProcessDataset: A vector of preprocessing steps to be applied (e.g., c("center", "scale", "medianImpute")). Defaults to NULL.

  • selectedPartitionSplit: A numeric value specifying the proportion of data to be used for training. Must be between 0 and 1. Defaults to 0.7.

  • selectedPackages: A character vector specifying the machine learning algorithms to be used for training (e.g., "nb", "rpart"). Defaults to c("nb", "rpart").

Details

The function performs preprocessing (e.g., centering, scaling, and imputation of missing values) on the dataset based on the provided settings. It splits the data into training and testing sets using the specified partition, trains models using cross-validation, and computes performance metrics.

For binary classification problems, the function calculates AUROC and prAUC. For multi-class classification, it calculates macro-averaged AUROC, though prAUC is not used.

The function returns a list of trained models along with their performance metrics, including confusion matrix, variable importance, and post-resample metrics.

Value

A list where each element corresponds to a trained model for one of the algorithms specified in settings$selectedPackages. Each element contains:

Examples

## Not run: 
dataset <- read.csv("fc_wo_noise.csv", header = TRUE, row.names = 1)

# Generate a file header for the dataset to use in downstream analysis
file_header <- generate_file_header(dataset)

settings <- list(
    fileHeader = file_header,
    # Columns selected for analysis
    selectedColumns = c("ExampleColumn1", "ExampleColumn2"), 
    clusterType = "Louvain",
    removeNA = TRUE,
    preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
    target_clusters_range = c(3,4),
    resolution_increments = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
    min_modularities = c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9),
    pickBestClusterMethod = "Modularity",
    seed = 1337
)

result <- immunaut(dataset, settings)
dataset_ml <- result$dataset$original
dataset_ml$pandora_cluster <- tsne_clust[[i]]$info.norm$pandora_cluster
dataset_ml <- dplyr::rename(dataset_ml, immunaut = pandora_cluster)
dataset_ml <- dataset_ml[, c("immunaut", setdiff(names(dataset_ml), "immunaut"))]
settings_ml <- list(
    excludedColumns = c("ExampleColumn0"),
    preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
    selectedPartitionSplit = split,  # Use the current partition split
    selectedPackages = c("rf", "RRF", "RRFglobal", "rpart2", "c5.0", "sparseLDA", 
    "gcvEarth", "cforest", "gaussPRPoly", "monmlp", "slda", "spls"),
    trainingTimeout = 180  # Timeout 3 minutes
)
ml_results <- auto_simon_ml(dataset_ml, settings_ml)

## End(Not run)


Perform t-Distributed Stochastic Neighbor Embedding (t-SNE)

Description

The calculate_tsne function reduces high-dimensional data into a 2-dimensional space using t-SNE for visualization and analysis. This function dynamically adjusts t-SNE parameters based on the characteristics of the dataset, ensuring robust handling of edge cases. It also performs data validation, such as checking for sufficient data, removing zero variance columns, and adjusting perplexity for optimal performance.

Usage

calculate_tsne(dataset, settings, removeGroups = TRUE)

Arguments

dataset

A data frame or matrix containing the dataset to be processed. Must contain numeric columns.

settings

A list of settings for t-SNE, which may include fileHeader, groupingVariables, perplexity, max_iter, eta, theta, exaggeration_factor, and preProcessDataset.

removeGroups

Logical, indicating whether to remove grouping variables before performing t-SNE. Default is TRUE.

Value

A list containing:


Cast All Strings to NA

Description

This function processes the columns of a given dataset, converting all non-numeric string values (including factor columns converted to character) to NA. It excludes specified columns from this transformation. Columns that are numeric or of other types are left unchanged.

Usage

castAllStringsToNA(dataset, excludeColumns = c())

Arguments

dataset

A data frame containing the dataset to be processed.

excludeColumns

A character vector specifying the names of columns to be excluded from processing. These columns will not have any values converted to NA.

Details

The function iterates through the specified columns (excluding those listed in excludeColumns), converts factors to character, and then attempts to convert character values to numeric. Any non-numeric strings will be converted to NA. This is useful for cleaning datasets that may contain mixed data types.

Value

A data frame where non-numeric strings in the included columns are replaced with NA, and all other columns remain unchanged.


Perform Density-Based Clustering on t-SNE Results Using DBSCAN

Description

This function applies Density-Based Spatial Clustering of Applications with Noise (DBSCAN) on t-SNE results to identify clusters and detect noise points. It dynamically calculates the MinPts and eps parameters based on the t-SNE results and settings provided. Additionally, the function computes silhouette scores to evaluate cluster quality and returns cluster centroids along with cluster sizes.

Usage

cluster_tsne_density(info.norm, tsne.norm, settings)

Arguments

info.norm

A data frame containing the normalized data on which the t-SNE analysis was carried out.

tsne.norm

The t-SNE results object, including the 2D t-SNE coordinates (Y matrix).

settings

A list of settings for the DBSCAN clustering. These settings include:

  • minPtsAdjustmentFactor: A factor to adjust the minimum number of points required to form a cluster (MinPts).

  • epsQuantile: The quantile used to determine the eps value for DBSCAN.

Details

The function first calculates MinPts based on the dimensionality of the t-SNE data and adjusts it using the provided minPtsAdjustmentFactor. The eps value is determined dynamically from the k-nearest neighbors distance using the quantile specified by epsQuantile. DBSCAN is then applied to the t-SNE data, and any NA values in the cluster assignments are replaced with a predefined outlier cluster ID (100). Finally, the function calculates cluster centroids, sizes, and silhouette scores to evaluate cluster separation and quality.

Value

A list containing:


Perform Hierarchical Clustering on t-SNE Results

Description

This function applies hierarchical clustering to t-SNE results, allowing for the identification of clusters in a reduced-dimensional space. The function also handles outliers by using DBSCAN for initial noise detection, and provides options to include or exclude outliers from the clustering process. Silhouette scores are computed to evaluate clustering quality, and cluster centroids are returned for visualization.

Usage

cluster_tsne_hierarchical(info.norm, tsne.norm, settings)

Arguments

info.norm

A data frame containing the normalized data on which the t-SNE analysis was carried out.

tsne.norm

The t-SNE results object, including the 2D t-SNE coordinates (Y matrix).

settings

A list of settings for the clustering analysis. The settings must include:

  • clustLinkage: The linkage method for hierarchical clustering (e.g., "ward.D2").

  • clustGroups: The number of groups (clusters) to cut the hierarchical tree into.

  • distMethod: The distance metric to be used (e.g., "euclidean").

  • minPtsAdjustmentFactor: A factor to adjust the minimum number of points required to form a cluster (MinPts).

  • epsQuantile: The quantile used to determine the eps value for DBSCAN.

  • excludeOutliers: A logical value indicating whether to exclude outliers detected by DBSCAN from hierarchical clustering.

  • pointSize: A numeric value used to adjust the placement of outlier centroids.

Details

The function first uses DBSCAN to detect outliers (marked as cluster "100") and then applies hierarchical clustering on the t-SNE results, either including or excluding the outliers depending on the settings. Silhouette scores are computed to assess the quality of the clustering. Cluster centroids are calculated and returned, along with the sizes of each cluster. Outliers, if detected, are handled separately in the final centroid calculation.

Value

A list containing:


Perform KNN and Louvain Clustering on t-SNE Results

Description

This function performs clustering on t-SNE results by first applying K-Nearest Neighbors (KNN) to construct a graph, and then using the Louvain method for community detection. The function dynamically adjusts KNN parameters based on the size of the dataset, ensuring scalability. Additionally, it computes the silhouette score to evaluate cluster quality and calculates cluster centroids for visualization.

Usage

cluster_tsne_knn_louvain(
  info.norm,
  tsne.norm,
  settings,
  resolution_increment = 0.1,
  min_modularity = 0.5
)

Arguments

info.norm

A data frame containing the normalized data on which the t-SNE analysis was carried out.

tsne.norm

A list containing the t-SNE results, including a 2D t-SNE coordinate matrix in the Y element.

settings

A list of settings for the analysis, including:

  • knn_clusters: The number of nearest neighbors to use for KNN (default: 250).

  • target_clusters_range: A numeric vector specifying the target range for the number of clusters.

  • start_resolution: The starting resolution for Louvain clustering.

  • end_resolution: The maximum resolution to test.

  • min_modularity: The minimum acceptable modularity for valid clusterings.

resolution_increment

The step size for incrementing the Louvain clustering resolution. Defaults to 0.1.

min_modularity

The minimum modularity score allowed for a valid clustering. Defaults to 0.5.

Details

This function begins by constructing a KNN graph from the t-SNE results, then applies the Louvain algorithm for community detection. The KNN parameter is dynamically adjusted based on the size of the dataset to ensure scalability. The function evaluates clustering quality using silhouette scores and calculates cluster centroids for visualization. NA cluster assignments are handled by assigning them to a separate cluster labeled as "100."

Value

A list containing the following elements:


Apply Mclust Clustering on t-SNE Results

Description

This function performs Mclust clustering on the 2D t-SNE results, which are derived from high-dimensional data. It includes an initial outlier detection step using DBSCAN, and the user can specify whether to exclude outliers from the clustering process. Silhouette scores are computed to evaluate the quality of the clustering, and cluster centroids are returned for visualization, with outliers handled separately.

Usage

cluster_tsne_mclust(info.norm, tsne.norm, settings)

Arguments

info.norm

A data frame containing the normalized data on which the t-SNE analysis was carried out.

tsne.norm

The t-SNE results object, including the 2D t-SNE coordinates (Y matrix).

settings

A list of settings for the clustering analysis, including:

  • clustGroups: The number of groups (clusters) for Mclust to fit.

  • minPtsAdjustmentFactor: A factor to adjust the minimum number of points required to form a cluster (MinPts) in DBSCAN.

  • epsQuantile: The quantile used to determine the eps value for DBSCAN.

  • excludeOutliers: A logical value indicating whether to exclude outliers detected by DBSCAN from the Mclust clustering.

  • pointSize: A numeric value used to adjust the placement of outlier centroids.

Details

The function first uses DBSCAN to detect outliers (marked as cluster "100") and then applies Mclust clustering on the t-SNE results. Outliers can be either included or excluded from the clustering, depending on the settings. Silhouette scores are calculated to assess the quality of the clustering. Cluster centroids are returned, along with the sizes of each cluster, and outliers are handled separately in the centroid calculation.

Value

A list containing:


Find Optimal Resolution for Louvain Clustering

Description

This function iterates over a range of resolution values to find the optimal resolution for Louvain clustering, balancing the number of clusters and modularity. It aims to identify a resolution that results in a reasonable number of clusters while maintaining a high modularity score.

Usage

find_optimal_resolution(
  graph,
  start_resolution = 0.1,
  end_resolution = 10,
  resolution_increment = 0.1,
  min_modularity = 0.3,
  target_clusters_range = c(3, 6)
)

Arguments

graph

An igraph object representing the graph to be clustered.

start_resolution

Numeric. The starting resolution for the Louvain algorithm. Default is 0.1.

end_resolution

Numeric. The maximum resolution to test. Default is 10.

resolution_increment

Numeric. The increment to adjust the resolution at each step. Default is 0.1.

min_modularity

Numeric. The minimum acceptable modularity for valid clusterings. Default is 0.3.

target_clusters_range

Numeric vector of length 2. Specifies the acceptable range for the number of clusters (inclusive). Default is c(3, 6).

Details

The function performs Louvain clustering at different resolutions, starting from start_resolution and ending at end_resolution, incrementing by resolution_increment at each step. At each resolution, the function calculates the number of clusters and modularity. The results are filtered to select those where modularity exceeds min_modularity and the number of clusters falls within the specified range target_clusters_range. The optimal resolution is chosen based on the most frequent number of clusters and the median resolution that satisfies these criteria.

Value

A list containing:

selected

A list with the optimal resolution, best modularity, and number of clusters.

frequent_clusters_results

A data frame containing results for resolutions that yielded the most frequent number of clusters.

all_results

A data frame with the resolution, number of clusters, and modularity for all tested resolutions.


Generate a Demo Dataset with Specified Number of Clusters and Overlap

Description

This function generates a demo dataset with a specified number of subjects, features, and desired number of clusters, ensuring that the generated clusters are not too far apart and have some degree of overlap to simulate real-world data. The generated dataset includes demographic information (outcome, age, and gender), as well as numeric features with a specified probability of missing values.

Usage

generate_demo_data(
  n_subjects = 1000,
  n_features = 200,
  missing_prob = 0.1,
  desired_number_clusters = 3,
  cluster_overlap_sd = 15
)

Arguments

n_subjects

Integer. The number of subjects (rows) to generate. Defaults to 1000.

n_features

Integer. The number of features (columns) to generate. Defaults to 200.

missing_prob

Numeric. The probability of introducing missing values (NA) in the feature columns. Defaults to 0.1.

desired_number_clusters

Integer. The approximate number of clusters to generate in the feature space. Defaults to 3.

cluster_overlap_sd

Numeric. The standard deviation to control cluster overlap. Defaults to 15 for more overlap.

Details

The function generates n_features numeric columns based on Gaussian clusters with some overlap between clusters to simulate more realistic data. Missing values are introduced in each feature column based on the missing_prob.

Value

A data frame containing the generated demo dataset, with columns:

Examples


# Generate a demo dataset with 1000 subjects, 200 features, and 3 clusters
demo_data <- generate_demo_data(n_subjects = 1000, n_features = 200, 
                                desired_number_clusters = 3, 
                                cluster_overlap_sd = 15, missing_prob = 0.1)

# View the first few rows of the dataset
head(demo_data)



Generate a File Header

Description

This function generates a fileHeader object from a given data frame which includes original names and remapped names of the data frame columns.

Usage

generate_file_header(dataset)

Arguments

dataset

The input data frame.

Value

A data frame containing original and remapped column names.


Main function to carry out Immunaut Analysis

Description

This function performs clustering and dimensionality reduction analysis on a dataset using user-defined settings. It handles various preprocessing steps, dimensionality reduction via t-SNE, multiple clustering methods, and generates associated plots based on user-defined or default settings.

Usage

immunaut(dataset, settings = list())

Arguments

dataset

A data frame representing the dataset on which the analysis will be performed. The dataset must contain numeric columns for dimensionality reduction and clustering.

settings

A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain:

fileHeader

A data frame mapping the original column names to remapped column names. Used for t-SNE input preparation.

selectedColumns

Character vector of columns to be used for the analysis. Defaults to NULL.

cutOffColumnSize

Numeric. The maximum size of the dataset in terms of columns. Defaults to 50,000.

excludedColumns

Character vector of columns to exclude from the analysis. Defaults to NULL.

groupingVariables

Character vector of columns to use for grouping the data during analysis. Defaults to NULL.

colorVariables

Character vector of columns to use for coloring in the plots. Defaults to NULL.

preProcessDataset

Character vector of preprocessing methods to apply (e.g., scaling, normalization). Defaults to NULL.

fontSize

Numeric. Font size for plots. Defaults to 12.

pointSize

Numeric. Size of points in plots. Defaults to 1.5.

theme

Character. The ggplot2 theme to use (e.g., "theme_gray"). Defaults to "theme_gray".

colorPalette

Character. Color palette for plots (e.g., "RdPu"). Defaults to "RdPu".

aspect_ratio

Numeric. The aspect ratio of plots. Defaults to 1.

clusterType

Character. The clustering method to use. Options are "Louvain", "Hierarchical", "Mclust", "Density". Defaults to "Louvain".

removeNA

Logical. Whether to remove rows with NA values. Defaults to FALSE.

datasetAnalysisGrouped

Logical. Whether to perform grouped dataset analysis. Defaults to FALSE.

plot_size

Numeric. The size of the plot. Defaults to 12.

knn_clusters

Numeric. The number of clusters for KNN-based clustering. Defaults to 250.

perplexity

Numeric. The perplexity parameter for t-SNE. Defaults to NULL (automatically determined).

exaggeration_factor

Numeric. The exaggeration factor for t-SNE. Defaults to NULL.

max_iter

Numeric. The maximum number of iterations for t-SNE. Defaults to NULL.

theta

Numeric. The Barnes-Hut approximation parameter for t-SNE. Defaults to NULL.

eta

Numeric. The learning rate for t-SNE. Defaults to NULL.

clustLinkage

Character. Linkage method for hierarchical clustering. Defaults to "ward.D2".

clustGroups

Numeric. The number of groups for hierarchical clustering. Defaults to 9.

distMethod

Character. Distance metric for clustering. Defaults to "euclidean".

minPtsAdjustmentFactor

Numeric. Adjustment factor for the minimum points in DBSCAN clustering. Defaults to 1.

epsQuantile

Numeric. Quantile to compute the epsilon parameter for DBSCAN clustering. Defaults to 0.9.

assignOutliers

Logical. Whether to assign outliers in the clustering step. Defaults to TRUE.

excludeOutliers

Logical. Whether to exclude outliers from clustering. Defaults to TRUE.

legendPosition

Character. Position of the legend in plots (e.g., "right", "bottom"). Defaults to "right".

datasetAnalysisClustLinkage

Character. Linkage method for dataset-level analysis. Defaults to "ward.D2".

datasetAnalysisType

Character. Type of dataset analysis (e.g., "heatmap"). Defaults to "heatmap".

datasetAnalysisRemoveOutliersDownstream

Logical. Whether to remove outliers during downstream dataset analysis (e.g., machine learning). Defaults to FALSE.

datasetAnalysisSortColumn

Character. The column used to sort dataset analysis results. Defaults to "cluster".

datasetAnalysisClustOrdering

Numeric. The order of clusters for analysis. Defaults to 1.

anyNAValues

Logical. Whether the dataset contains NA values. Defaults to FALSE.

categoricalVariables

Logical. Whether the dataset contains categorical variables. Defaults to FALSE.

resolution_increments

Numeric vector. The resolution increments to be used for Louvain clustering. Defaults to c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5).

min_modularities

Numeric vector. The minimum modularities to test for clustering. Defaults to c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9).

target_clusters_range

Numeric vector. The range of acceptable clusters to identify. Defaults to c(3, 6).

pickBestClusterMethod

Character. The method to use for picking the best clustering result ("Modularity", "Silhouette", or "SIMON"). Defaults to "Modularity".

weights

List. Weights for evaluating clusters based on AUROC, modularity, and silhouette. Defaults to list(AUROC = 0.5, modularity = 0.3, silhouette = 0.2). These weights are applied to help choose the most relevant clusters based on user goals:

AUROC

Weight for predictive performance (area under the receiver operating characteristic curve). Prioritize this when predictive accuracy is the main goal. For predictive analysis, a recommended configuration could be list(AUROC = 0.8, modularity = 0.1, silhouette = 0.1).

modularity

Weight for modularity score, which indicates the strength of clustering. Higher modularity suggests that clusters are well-separated. To prioritize well-separated clusters, use a configuration like list(AUROC = 0.4, modularity = 0.4, silhouette = 0.2).

silhouette

Weight for silhouette score, a measure of cohesion within clusters. Useful when cluster cohesion and interpretability are desired. For balanced clusters, a suggested configuration is list(AUROC = 0.4, modularity = 0.3, silhouette = 0.3).

Value

A list containing the following:

Examples


  data <- matrix(runif(2000), ncol=20)
  settings <- list(clusterType = "Louvain", 
  resolution_increments = c(0.05, 0.1), 
  min_modularities = c(0.3, 0.5))
  result <- immunaut(data.frame(data), settings)
  print(result$clusters)



Demo data set from immunaut package. This data is used in this package examples. It consist of 4x4 feature matrix + additional dummy columns that can be used for testing.

Description

Demo data set from immunaut package. This data is used in this package examples. It consist of 4x4 feature matrix + additional dummy columns that can be used for testing.

Usage

data(immunautDemo)

Format

An object of class data.frame with 4 rows and 7 columns.

Examples

## Not run: 
	data(immunautDemo)
	## define settings variable
	settings <- list()
	settings$fileHeader <- generate_file_header(immunautDemo)
	# ... and other settings
	results <- immunaut(immunautDemo, settings)

## End(Not run)


Demo data set from immunaut package. This data is used in this package examples.

Description

Demo data set from immunaut package. This data is used in this package examples.

Usage

data(immunautLAIV)

Format

An object of class data.frame with 244 rows and 32 columns.

Examples

## Not run: 
	data(immunautLAIV)
	## define settings variable
	settings <- list()
	settings$fileHeader <- generate_file_header(immunautLAIV)
	# ... and other settings
	results <- immunaut(immunautLAIV, settings)

## End(Not run)


Is Numeric

Description

Determines whether a variable is a number or a numeric string

Usage

isNumeric(x)

Arguments

x

Variable to be checked

Value

Logical indicating whether x is numeric and non-NA


Check if request variable is Empty

Description

Checks if the given variable is empty and optionally logs the variable name.

Usage

is_var_empty(variable)

Arguments

variable

The variable to check.

Value

boolean TRUE if the variable is considered empty, FALSE otherwise.


Pick Best Cluster by Modularity

Description

This function selects the best cluster from a list of clustering results based on the highest modularity score.

Usage

pick_best_cluster_modularity(tsne_clust)

Arguments

tsne_clust

A list of clustering results where each element contains clustering information, including the modularity score.

Details

The function iterates over a list of clustering results (tsne_clust) and selects the cluster with the highest modularity score. If no clusters are valid or the tsne_clust list is empty, the function will stop and return an error.

Value

Returns the clustering result with the highest modularity score.


Pick the Best Clustering Result Based on Multiple Metrics

Description

This function evaluates multiple clustering results based on various metrics such as modularity, silhouette score, Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CH). It normalizes the scores across all metrics, calculates a combined score for each clustering result, and selects the best clustering result.

Usage

pick_best_cluster_overall(tsne_clust, tsne_calc)

Arguments

tsne_clust

A list of clustering results. Each result should contain metrics such as modularity, silhouette score, and cluster assignments for the dataset.

tsne_calc

A list containing the t-SNE results. It includes the t-SNE coordinates of the dataset used for clustering.

Details

The function computes four different metrics for each clustering result:

The scores for each metric are normalized between 0 and 1, and an overall score is calculated for each clustering result. The clustering result with the highest overall score is selected as the best.

Value

The clustering result with the highest combined score based on modularity, silhouette score, Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CH).


Pick Best Cluster by Silhouette Score

Description

This function selects the best cluster from a list of clustering results based on the highest average silhouette score.

Usage

pick_best_cluster_silhouette(tsne_clust)

Arguments

tsne_clust

A list of clustering results where each element contains clustering information, including the average silhouette score.

Details

The function iterates over a list of clustering results (tsne_clust) and selects the cluster with the highest average silhouette score. If no clusters are valid or the tsne_clust list is empty, the function will stop and return an error.

Value

Returns the clustering result with the highest average silhouette score.


Select the Best Clustering Based on Weighted Scores: AUROC, Modularity, and Silhouette

Description

This function selects the optimal clustering configuration from a list of t-SNE clustering results by evaluating each configuration's AUROC, modularity, and silhouette scores. These scores are combined using a weighted average, allowing for a more comprehensive assessment of each configuration's relevance.

Usage

pick_best_cluster_simon(dataset, tsne_clust, tsne_calc, settings)

Arguments

dataset

A data frame representing the original dataset, where each observation will be assigned cluster labels from each clustering configuration in tsne_clust.

tsne_clust

A list of clustering results from different t-SNE configurations, with each element containing pandora_cluster assignments and clustering information.

tsne_calc

An object containing t-SNE results on dataset.

settings

A list of settings for machine learning model training and scoring, including:

excludedColumns

A character vector of columns to exclude from the analysis.

preProcessDataset

A character vector of preprocessing steps (e.g., scaling, centering).

selectedPartitionSplit

Numeric; the partition split ratio for train/test splits.

selectedPackages

Character vector of machine learning models to train.

trainingTimeout

Numeric; time limit (in seconds) for training each model.

weights

A list of weights for scoring criteria: weights$AUROC, weights$modularity, and weights$silhouette (default is 0.4, 0.3, and 0.3 respectively).

Details

For each clustering configuration in tsne_clust, this function:

  1. Assigns cluster labels to the dataset.

  2. Trains machine learning models specified in settings on the dataset with cluster labels.

  3. Evaluates each model based on AUROC, modularity, and silhouette scores.

  4. Selects the clustering configuration with the highest weighted average score as the best clustering result.

Value

A list containing the best clustering configuration (with the highest weighted score) and its associated information.


Plot Clustered t-SNE Results

Description

This function generates a t-SNE plot with cluster assignments using consistent color mappings. It includes options for plotting points based on their t-SNE coordinates and adding cluster labels at the cluster centroids. The plot is saved as an SVG file in a temporary directory.

Usage

plot_clustered_tsne(info.norm, cluster_data, settings)

Arguments

info.norm

A data frame containing t-SNE coordinates (tsne1, tsne2) and cluster assignments (pandora_cluster) for each point.

cluster_data

A data frame containing the cluster centroids and labels, with columns tsne1, tsne2, label, and pandora_cluster.

settings

A list of settings for the plot, including:

  • theme: The ggplot2 theme to use (e.g., "theme_classic").

  • colorPalette: The color palette to use for clusters (e.g., "RdPu").

  • pointSize: The size of points in the plot.

  • fontSize: The font size used in the plot.

  • legendPosition: The position of the legend (e.g., "right").

  • plot_size: The size of the plot.

  • aspect_ratio: The aspect ratio of the plot.

Value

ggplot2 object representing the clustered t-SNE plot.

Examples

## Not run: 
# Example usage
plot <- plot_clustered_tsne(info.norm, cluster_data, settings)
print(plot)

## End(Not run)

Preprocess a Dataset Using Specified Methods

Description

This function preprocesses a dataset by applying a variety of transformation methods, such as centering, scaling, or imputation. Users can also specify columns to exclude from preprocessing. The function supports a variety of preprocessing methods, including dimensionality reduction and imputation techniques, and ensures proper method application order.

Usage

preProcessData(
  data,
  outcome,
  excludeClasses,
  methods = c("center", "scale"),
  settings
)

Arguments

data

A data frame or matrix representing the dataset to be preprocessed.

outcome

A character string representing the outcome variable, if any, for outcome-based transformations.

excludeClasses

A character vector specifying the column names to exclude from preprocessing. Default is NULL, meaning all columns are included in the preprocessing.

methods

A character vector specifying the preprocessing methods to apply. Default methods are c("center", "scale"). Available methods include: - "medianImpute": Impute missing values with the median. - "bagImpute": Impute missing values using bootstrap aggregation. - "knnImpute": Impute missing values using k-nearest neighbors. - "center": Subtract the mean from each feature. - "scale": Divide features by their standard deviation. - "pca": Principal Component Analysis for dimensionality reduction. - Other methods such as "BoxCox", "YeoJohnson", "range", etc.

settings

A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain: - seed: An integer seed value for reproducibility.

Details

The function applies various transformations to the dataset as specified by the user. It ensures that methods are applied in the correct order to maintain data integrity and consistency. If fewer than two columns remain after excluding specified columns, the function halts and returns NULL. The function also handles categorical columns by skipping their transformation. Users can also specify outcome variables for specialized preprocessing.

Value

A list containing:


Pre-process and Resample Dataset

Description

This function applies pre-processing transformations to the dataset, then resamples it.

Usage

preProcessResample(
  datasetData,
  preProcess,
  selectedOutcomeColumns,
  outcome_and_classes,
  settings
)

Arguments

datasetData

Dataframe to be pre-processed

preProcess

Vector of pre-processing methods to apply

selectedOutcomeColumns

Character vector of outcome columns

outcome_and_classes

List of outcomes and their classes

settings

A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain: - seed: An integer seed value for reproducibility.

Value

A list containing the pre-processing mapping and the processed dataset


Remove Outliers Based on Cluster Information

Description

The remove_outliers function removes rows from a dataset based on the presence of outliers marked by a specific cluster ID (typically 100) in the pandora_cluster column. This function is meant to be used internally during downstream dataset analysis to filter out data points that have been identified as outliers during clustering.

Usage

remove_outliers(dataset, settings)

Arguments

dataset

A data frame that includes clustering results, particularly a pandora_cluster column.

settings

A list of settings. Must contain the logical value datasetAnalysisRemoveOutliersDownstream. If datasetAnalysisRemoveOutliersDownstream is TRUE, outliers (rows where pandora_cluster == 100) will be removed from the dataset.

Value

A filtered data frame with outliers removed if applicable.