Type: | Package |
Title: | Collection of Machine Learning Datasets for Supervised Machine Learning |
Version: | 1.0.1 |
Maintainer: | Gary Hutson <hutsons-hacks@outlook.com> |
Description: | Contains a collection of datasets for working with machine learning tasks. It will contain datasets for supervised machine learning Jiang (2020)<doi:10.1016/j.beth.2020.05.002> and will include datasets for classification and regression. The aim of this package is to use data generated around health and other domains. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
BugReports: | https://github.com/StatsGary/MLDataR/issues |
Imports: | ConfusionTableR, dplyr, parsnip, rsample, recipes, workflows, ranger, caret, varhandle, OddsPlotty, ggplot2 |
RoxygenNote: | 7.1.2 |
Suggests: | rmarkdown, knitr |
VignetteBuilder: | knitr |
Depends: | R (≥ 2.10) |
NeedsCompilation: | no |
Packaged: | 2022-10-03 14:44:47 UTC; garyh |
Author: | Gary Hutson |
Repository: | CRAN |
Date/Publication: | 2022-10-03 15:10:02 UTC |
PreDiabetes dataset
Description
PreDiabetes dataset
Usage
PreDiabetes
Format
A data frame with 3059 rows and 9 variables:
- Age
age of the patient presenting with diabetes
- Sex
sex of the patient with diabetes
- IMD_Decile
Index of Multiple Deprivation Decile
- BMI
Body Mass Index of patient
- Age_PreDiabetes
age at pre diabetes diagnosis
- HbA1C
average blood glucose mmol/mol
- Time_Pre_To_Diabetes
time in years between pre-diabetes and diabetes diagnosis
- Age_Diabetes
age at diabetes diagnosis
- PreDiabetes_Checks_Before_Diabetes
number of pre-diabetes related primary care appointments before diabetes diagnosis
Source
Generated by Asif Laldin a.laldin@nhs.net, Jan-2022
Examples
library(dplyr)
data(PreDiabetes)
# Convert diabetes data to factor'
diabetes_data <- PreDiabetes %>%
glimpse()
Care Home Incidents
Description
a NHS patient safety incidents dataset: https://www.england.nhs.uk/patient-safety/report-patient-safety-incident/ dataset that has been synthetically generated against real data
Usage
care_home_incidents
Format
A data frame with 1216 rows and 12 variables:
- CareHomeFail
a binary indicator to specify whether a certain care home is failing
- WeightLoss
aggregation of incidents indicating weight loss in patient
- Medication
medication missed aggregaation
- Falls
Recorded number of patient falls
- Choking
Number of patient choking incidents
- UnexpectedDeaths
unexpected deaths in the care home
- Bruising
Number of bruising incidents in the care home
- Absconsion
Absconding from the care home setting
- ResidentAbuseByResident
Abuse conducted by one care home resident against another
- ResidentAbuseByStaff
Incidents of resident abuse by staff
- ResidentAbuseOnStaff
Incidents of residents abusing staff
- Wounds
Unexplained wounds against staff
Source
Collected by Gary Hutson hutsons-hacks@outlook.com, Jan-2022
Examples
library(dplyr)
data(care_home_incidents)
# Convert diabetes data to factor'
ch_incs <- care_home_incidents %>%
mutate(CareHomeFail = as.factor(CareHomeFail))
ch_incs %>% glimpse()
# Check factor
factor(ch_incs$CareHomeFail)
csgo
Description
csgo
Usage
csgo
Format
A data frame with 1,133 rows and 17 variables:
- map
Map on which the match was played
- day
Day of the month
- month
Month of the year
- year
Year
- date
Date of match DD/MM/YYYY
- wait_time_s
Time waited to find match
- match_time_s
Total match length in seconds
- team_a_rounds
Number of rounds played as Team A
- team_b_rounds
Number of rounds played as Team B
- ping
Maximum ping in milliseconds;the signal that's sent from one computer to another on the same network
- kills
Number of kills accumulated in match; max 5 per round
- assists
Number of assists accumulated in a match,inflicting oppononent with more than 50 percent damage,who is then killed by another player accumulated in match max 5 per round
- deaths
Number of times player died during match;max 1 per round
- mvps
Most Valuable Player award
- hs_percent
Percentage of kills that were a result from a shot to opponent's head
- points
Number of points accumulated during match. Apoints are gained from kills, assists,bomb defuses & bomb plants. Points are lost for sucicide and friendly kills
- result
The result of the match, Win, Loss, Draw
Source
Extracted by Asif Laldin a.laldin@nhs.net, March-2019
Diabetes datasets
Description
Diabetes datasets
Usage
diabetes_data
Format
A data frame with 520 rows and 17 variables:
- Age
age of the patient presenting with diabetes
- Gender
gender of the patient with diabetes
- ExcessUrination
if the patient has a history of excessive urination
- Polydipsia
abnormal thurst, accompanied by the excessive intake of water or fluid
- WeightLossSudden
Sudden weight loss that has recently occured
- Fatigue
Fatigue or weakness
- Polyphagia
excessive or extreme hunger
- GenitalThrush
patient has thrush fungus on or near their genital region
- BlurredVision
history of blurred vision
- Itching
skin itching
- Irritability
general irritability and mood issues
- DelayHealing
delayed healing of wounds
- PartialPsoriasis
partial psoriasis on the body
- MuscleStiffness
stiffness of the muscles
- Alopecia
scalp alopecia and hair shedding
- Obesity
Classified as obese
- DiabeticClass
Class label to indicate whether the patient is diabetic or not
Source
Collected by Gary Hutson hutsons-hacks@outlook.com, Dec-2021
Examples
library(dplyr)
data(diabetes_data)
# Convert diabetes data to factor'
diabetes_data <- diabetes_data %>%
glimpse() %>%
mutate(DiabeticClass = as.factor(DiabeticClass))
is.factor(diabetes_data$DiabeticClass)
Heart disease dataset
Description
The dataset is to be used with a supervised classification ML model to classify heart disease.
Usage
heartdisease
Format
A data frame with 918 rows and 10 variables:
- Age
age of the patient presenting with heart disease
- Sex
gender of the patient
- RestingBP
blood pressure for resting heart beat
- Cholesterol
Cholesterol reading
- FastingBS
blood sample of glucose after a patient fasts https://www.diabetes.co.uk/diabetes_care/fasting-blood-sugar-levels.html
- RestingECG
Resting echocardiography is an indicator of previous myocardial infarction e.g. heart attack
- MaxHR
Maximum heart rate
- Angina
chest pain caused by decreased flood flow https://www.nhs.uk/conditions/angina/
- HeartPeakReading
reading at the peak of the heart rate
- HeartDisease
the classification label of whether patient has heart disease or not
Source
Collected by Gary Hutson hutsons-hacks@outlook.com, Dec-2021
Examples
library(dplyr)
library(ConfusionTableR)
data(heartdisease)
# Convert diabetes data to factor'
hd <- heartdisease %>%
glimpse() %>%
mutate(HeartDisease = as.factor(HeartDisease))
# Check that the label is now a factor
is.factor(hd$HeartDisease)
# Dummy encoding
# Get categorical columns
hd_cat <- hd %>%
dplyr::select_if(is.character)
# Dummy encode the categorical variables
# Specify the columns to encode
cols <- c("RestingECG", "Angina", "Sex")
# Dummy encode using dummy_encoder in ConfusionTableR package
coded <- ConfusionTableR::dummy_encoder(hd_cat, cols, remove_original = TRUE)
coded <- coded %>%
select(RestingECG_ST, RestingECG_LVH, Angina=Angina_Y,
Sex=Sex_F)
# Remove column names we have encoded from original data frame
hd_one <- hd[,!names(hd) %in% cols]
# Bind the numerical data on to the categorical data
hd_final <- bind_cols(coded, hd_one)
# Output the final encoded data frame for the ML task
glimpse(hd_final)
Long stayers dataset
Description
classification dataset of long staying patients. Contains patients who have been registered as an inpatient for longer than 7 days length of stay https://www.england.nhs.uk/south/wp-content/uploads/sites/6/2016/12/rig-reviewing-stranded-patients-hospital.pdf.
Usage
long_stayers
Format
A data frame with 768 rows and 9 variables:
- stranded.label
binary classification label indicating whether stranded = 1 or not stranded=0
- age
age of the patient
- care.home.referral
flag indicating whether referred from a private care home - 1=Care Home Referral and 0=Not a care home referral
- medicallysafe
flag indicating whether they are medically safe for discharge - 1=Medically safe and 0=Not medically safe
- hcop
flag indicating health care for older person triage - 1=Yes triaged from HCOP and 0=Triaged from different department
- mental_health_care
flag indicating whether they require mental health care - 1=MH assistance needed and 0=No history of mental health
- periods_of_previous_care
Count of the number of times they have been in hospital in last 12 months
- admit_date
date the patient was admitted as an inpatient
- frailty_index
indicates the type of frailty - nominal variable
Source
Prepared, acquired and adatped by Gary Hutson hutsons-hacks@outlook.com, Dec-2021. Synthetic data, based off live patient data from various NHS secondary health care trusts.
Examples
library(dplyr)
library(ggplot2)
library(caret)
library(rsample)
library(varhandle)
data("long_stayers")
glimpse(long_stayers)
# Examine class imbalance
prop.table(table(long_stayers$stranded.label))
# Feature engineering
long_stayers <- long_stayers %>%
dplyr::mutate(stranded.label=factor(stranded.label)) %>%
dplyr::select(everything(), -c(admit_date))
# Feature encoding
cats <- select_if(long_stayers, is.character)
cat_dummy <- varhandle::to.dummy(cats$frailty_index, "frail_ind")
#Converts the frailty index column to dummy encoding and sets a column called "frail_ind" prefix
cat_dummy <- cat_dummy %>%
as.data.frame() %>%
dplyr::select(-frail_ind.No_index_item) #Drop the field of interest
long_stayers <- long_stayers %>%
dplyr::select(-frailty_index) %>%
bind_cols(cat_dummy) %>% na.omit(.)
# Split the data
split <- rsample::initial_split(long_stayers, prop = 3/4)
train <- rsample::training(split)
test <- rsample::testing(split)
set.seed(123)
glm_class_mod <- caret::train(factor(stranded.label) ~ ., data = train,
method = "glm")
print(glm_class_mod)
# Predict the probabilities
preds <- predict(glm_class_mod, newdata = test) # Predict class
pred_prob <- predict(glm_class_mod, newdata = test, type="prob") #Predict probs
predicted <- data.frame(preds, pred_prob)
test <- test %>%
bind_cols(predicted) %>%
dplyr::rename(pred_class=preds)
#Evaluate with ConfusionTableR
library(ConfusionTableR)
cm <- ConfusionTableR::binary_class_cm(test$stranded.label, test$pred_class, positive="Stranded")
cm$record_level_cm
# Visualise odds ration
library(OddsPlotty)
plotty <- OddsPlotty::odds_plot(glm_class_mod$finalModel,
title = "Odds Plot ",
subtitle = "Showing odds of patient stranded",
point_col = "#00f2ff",
error_bar_colour = "black",
point_size = .5,
error_bar_width = .8,
h_line_color = "red")
print(plotty)
Stroke Classification dataset
Description
This dataset has been obtained from a Stoke department within the NHS and is a traditional supervised ML classification dataset
Usage
stroke_classification
Format
A data frame with 5110 rows and 11 variables:
- pat_id
unique patient identifier index
- stroke
outcome variable as a flag - 1 for stroke and 0 for no stroke
- gender
patient gender description
- age
age of the patient
- hypertension
binary flag to indicate whether patient has hypertension: https://www.nhs.uk/conditions/high-blood-pressure-hypertension/
- heart_disease
binary flag to indicate whether patient has heart disease: 1 or no heart disease history: 0
- work_related_stress
binary flag to indicate whether patient has history of work related stress
- urban_residence
binary flag indicating whether patient lives in an urban area or not
- avg_glucose_level
average blood glucose readings of the patient
- bmi
body mass index of the patient: https://www.nhs.uk/live-well/healthy-weight/bmi-calculator/
- smokes
binary flag to indicate if the patient smokes - 1 for current smoker and 0 for smoking cessation
Source
Prepared and compiled by Gary Hutson hutsons-hacks@outlook.com, Apr-2022.
Thyroid disease dataset
Description
The dataset is to be used with a supervised classification ML model to classify thyroid disease. The dataset was sourced and adapted from the UCI Machine Learning repository https://archive.ics.uci.edu/ml/index.php.
Usage
thyroid_disease
Format
A data frame with 3772 rows and 28 variables:
- ThryroidClass
binary classification label indicating whether sick = 1 or negative=0
- patient_age
age of the patient
- patient_gender
flag indicating gender of patient - 1=Female and 0=Male
- presc_thyroxine
flag to indicate whether thyroxine replacement prescribed 1=Thyroxine prescribed
- queried_why_on_thyroxine
flag to indicate query has been actioned
- presc_anthyroid_meds
flag to indicate whether anti-thyroid medicine has been prescribed
- sick
flag to indicate sickness due to thyroxine depletion or over activity
- pregnant
flag to indicate whether the patient is pregnant
- thyroid_surgery
flag to indicate whether the patient has had thyroid surgery
- radioactive_iodine_therapyI131
indicates whether patient has had radioactive iodine treatment: https://www.nhs.uk/conditions/thyroid-cancer/treatment/
- query_hypothyroid
flag to indicate under active thyroid query https://www.nhs.uk/conditions/underactive-thyroid-hypothyroidism/
- query_hyperthyroid
flag to indicate over active thyroid query https://www.nhs.uk/conditions/overactive-thyroid-hyperthyroidism/
- lithium
Lithium carbonate administered to decrease the level of thyroid hormones
- goitre
flag to indicate swelling of the thyroid gland https://www.nhs.uk/conditions/goitre/
- tumor
flag to indicate a tumor
- hypopituitarism
flag to indicate a diagnosed under active thyroid
- psych_condition
indicates whether a patient has a psychological condition
- TSH_measured
a TSH level lower than normal indicates there is usually more than enough thyroid hormone in the body and may indicate hyperthyroidism
- TSH_reading
the reading result of the TSH blood test
- T3_measured
linked to TSH reading - when free triiodothyronine rise above normal this indicates hyperthyroidism
- T3_reading
the reading result of the T3 blood test looking for above normal levels of free triiodothyronine
- T4_measured
free thyroxine, also known as T4, is used with T3 and TSH tests to diagnose hyperthyroidism
- T4_reading
the reading result of th T4 test
- thyrox_util_rate_T4U_measured
flag indicating the thyroxine utilisation rate https://pubmed.ncbi.nlm.nih.gov/1685967/
- thyrox_util_rate_T4U_reading
the result of the test
- FTI_measured
flag to indicate measurement on the Free Thyroxine Index (FTI)https://endocrinology.testcatalog.org/show/FRTUP
- FTI_reading
the result of the test mentioned above
- ref_src
[nominal] indicating the referral source of the patient
Source
Prepared and adatped by Gary Hutson hutsons-hacks@outlook.com, Dec-2021 and sourced from Garavan Institute and J. Ross Quinlan.
References
Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan.
Examples
library(dplyr)
library(ConfusionTableR)
library(parsnip)
library(rsample)
library(recipes)
library(ranger)
library(workflows)
data("thyroid_disease")
td <- thyroid_disease
# Create a factor of the class label to use in ML model
td$ThryroidClass <- as.factor(td$ThryroidClass)
# Check the structure of the data to make sure factor has been created
str(td)
# Remove missing values, or choose more advaced imputation option
td <- td[complete.cases(td),]
#Drop the column for referral source
td <- td %>%
dplyr::select(-ref_src)
# Analyse class imbalance
class_imbalance <- prop.table(table(td$ThryroidClass))
class_imbalance
#Divide the data into a training test split
set.seed(123)
split <- rsample::initial_split(td, prop=3/4)
train_data <- rsample::training(split)
test_data <- rsample::testing(split)
# Create recipe to upsample and normalise
set.seed(123)
td_recipe <-
recipe(ThryroidClass ~ ., data=train_data) %>%
step_normalize(all_predictors()) %>%
step_zv(all_predictors())
# Instantiate the model
set.seed(123)
rf_mod <-
parsnip::rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
# Create the model workflow
td_wf <-
workflow() %>%
workflows::add_model(rf_mod) %>%
workflows::add_recipe(td_recipe)
# Fit the workflow to our training data
set.seed(123)
td_rf_fit <-
td_wf %>%
fit(data = train_data)
# Extract the fitted data
td_fitted <- td_rf_fit %>%
extract_fit_parsnip()
# Predict the test set on the training set to see model performance
class_pred <- predict(td_rf_fit, test_data)
td_preds <- test_data %>%
bind_cols(class_pred)
# Convert both to factors
td_preds$.pred_class <- as.factor(td_preds$.pred_class)
td_preds$ThryroidClass <- as.factor(td_preds$ThryroidClass)
# Evaluate the data with ConfusionTableR
cm <- ConfusionTableR::binary_class_cm(td_preds$ThryroidClass ,
td_preds$.pred_class,
positive="sick")
#View Confusion matrix
cm$confusion_matrix
#View record level
cm$record_level_cm