---
title: "TidyConsultant"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{TidyConsultant}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r include=FALSE}
library(Ckmeans.1d.dp)
```


```{r setup}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(TidyConsultant)
```

In this vignette we demonstrate how the various packages in the TidyConsultant universe work together in a data science workflow by analyzing insurance data set, which consists of a population of people described by their characteristics and their insurance charge. 

```{r}
data(insurance)
```
## Analyze data properties with validata

```{r}
insurance %>% 
  diagnose()
```


```{r}
insurance %>% 
  diagnose_numeric()
```

```{r}
insurance %>% 
  diagnose_category(everything(),  max_distinct = 7) %>% 
  print(width = Inf)
```

We can infer that each row represents a person, uniquely identified by 5 characteristics. The charges column is also unique, but simply because it is a high-precision double.

```{r}
insurance %>% 
  determine_distinct(everything())
```

## Analyze data relationships with autostats

```{r}
insurance %>% 
  auto_cor(sparse = TRUE) -> cors

cors
```

Analyze the relationship between categorical and continuous variables
```{r}
insurance %>% 
  auto_anova(everything(), baseline = "first_level") -> anovas1

anovas1 %>% 
  print(n = 50)
```

Use the sparse option to show only the most significant effects

```{r}

insurance %>% 
  auto_anova(everything(), baseline = "first_level", sparse = T, pval_thresh = .1) -> anovas2

anovas2 %>% 
  print(n = 50)

```

From this we can conclude that smokers and males incur higher insurance charges. It may be beneficial to explore some interaction effects, considering that BMI's effect on charges will be sex-dependent. 

### create dummies out of categorical variables

```{r}
insurance %>% 
  create_dummies(remove_most_frequent_dummy = T) -> insurance1
```


### explore model-based variable contributions

```{r}
insurance1 %>% 
  tidy_formula(target = charges) -> charges_form

charges_form
```


```{r message=FALSE, warning=FALSE, eval=FALSE}
insurance1 %>% 
  auto_variable_contributions(formula = charges_form)
```

```{r message=FALSE, warning=FALSE, eval=FALSE}
insurance1 %>%
  auto_model_accuracy(formula = charges_form, include_linear = T)
```

### create bins with tidybins

```{r}
insurance1 %>% 
  bin_cols(charges) -> insurance_bins

insurance_bins
```

Here we can see a summary of the binned cols. The with quintile "value" bins we can see that the top 20% of charges comes from the top x people. 

```{r}
insurance_bins %>% 
  bin_summary()
```

## Create and analyze xgboost model for binary classification

Let's see if we can predict whether a customer is a smoker.

Xgboost considers the first level of a factor to be the "positive event" so 
let's ensure that "yes" is the first level using `framecleaner::set_fct`

```{r}
insurance %>% 
  set_fct(smoker, first_level = "yes") -> insurance

insurance %>% 
  create_dummies(where(is.character), remove_first_dummy = T) -> insurance_dummies

insurance_dummies %>% 
  diagnose

```

And create a new formula for the binary classification.

```{r}
insurance_dummies %>% 
  tidy_formula(target = smoker) -> smoker_form

smoker_form
```

Now we can create a basic model using `tidy_xgboost`. Built in heuristic will automatically recognize this as a binary classification task. We can tweak some parameters to add some regularization and increase the number of trees. Xgboost will automatically output feature importance on the training set, and a measure of accuracy tested on a validation set. 

```{r}
insurance_dummies %>% 
  tidy_xgboost(formula = smoker_form, 
              mtry = .5,
              trees = 100L,
              loss_reduction = 1,
              alpha = .1,
              sample_size = .7) -> smoker_xgb_classif


```


Obtain predictions

```{r}
smoker_xgb_classif %>% 
  tidy_predict(newdata = insurance_dummies, form = smoker_form) -> insurance_fit

```


Analyze the probabilities using tidybins. We can find the top 20% of customers most likely to be smokers. 

```{r}
names(insurance_fit)[length(names(insurance_fit)) - 1] -> prob_preds

insurance_fit %>% 
  bin_cols(prob_preds, n_bins = 5) -> insurance_fit1

insurance_fit1 %>% 
  bin_summary()
```


Evaluate the training error. `eval_preds` uses both the probability estimates and binary estimates to calculate a variety of metrics. 

```{r}
insurance_fit1 %>% 
  eval_preds()
```
Traditional `yardstick` confusion matrix can be created manually. 

```{r}
names(insurance_fit)[length(names(insurance_fit))] -> class_preds

insurance_fit1 %>% 
  yardstick::conf_mat(truth = smoker, estimate = class_preds) -> conf_mat_sm

conf_mat_sm
```