--- title: "simulate-dataset" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{simulate-dataset} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(delaydiscount) library(dplyr) ``` The `simulate_dataset` function can be used to simulate data from the hierarchical linearized hyperbolic model. The hierarchical linearized hyperboloid model works as follows. The hyperbolic model for the discounting curve is $D(t) = \frac{1}{1+kt}$. However, actual discounting rates reported by subjects may not fit this curve. Actual discounting proportions are assumed to be between 0 and 1, exclusive. They are transformed from the $(0,1)$ scale to a $(-\infty, \infty)$ scale by the function $\phi(x) = \log(\frac{1}{x}-1)$. The model discounting curve with the transformation applied is $\log(\frac{1}{D(t)} - 1) = \log(k) + \log(t)$. The transformed curve provides a mean structure for the data, conditional on the subject's $\log(k)$ parameter. We assume that the transformed observed indifference point $\tilde D(t_{ijc})$ will follow the model $\log(\frac{1}{\tilde D(t_{ijc})} - 1) = \log(k_{ic}) + \log(t_{ijc}) + \epsilon_{ijc},$ where $\epsilon_{ijc} \sim N(0, \sigma^2)$, $i$ is an index for subject, $j$ an index for time point, and $c$ is an index for group/condition. $\log(k_{ic}) \sim N(\theta_c, \frac{g\sigma^2}{T})$, where $\theta_c$ is the population mean of the $\log(k_{ic})$ for subjects in group $c$, and $T$ is the number of time points observed for each subject. The model is hierarchical, as the $\log(k_{ic})$ parameter for each subject within a group is assumed to be a realization from a normal distribution with some population mean $\theta_c$ for its subjects, and variance $\frac{g\sigma^2}{T}$, i.e. the variance is constant across subjects, across groups. A call to `simulate_dataset` is shown below. ```{r} set.seed(8678) sim_data <- simulate_dataset(groups = c("EFT", "HIT", "NCC"), num_subj = c(120, 142, 172), time_points = c(30, 90, 180, 365, 1095, 1825, 3650), mean_ln_k = c(-6.877, -5.861, -6.078), sigma_sq = 1.978, g = 10.407) ``` The `groups` vector gives the name of the groups in the simulated study. The `num_subj` vector gives the sample size for each group. The `time_points` vector gives the surveyed delay lengths. The `mean_ln_k` vector gives the group mean hyperparameters. The value `sigma_sq` gives the parameter $\sigma^2$, which represents the conditional variance of an observed transformed indifference point $\log(\frac{1}{\tilde D(t_{ijc})} - 1)$ given the subject's $\log(k_{ic})$ parameter. The value `g` gives the ratio of the variance subject $\log(k_{ic})$ parameter to the conditional variance of the least squares (on the transformed scale) estimator of a subject's $\log(k_{ic})$ parameter given the true $\log(k_{ic})$. The result is that the true $\log(k_{ic})$ parameters have variance $\frac{g\sigma^2}{T}$. (Recall $T$ is the length of the `time_points` vector.) The data frame output by `sim_data` can be analyzed using our analysis methods. In particular, `prepare_data_frame` and `jb_rule_check` are ready to be run on the data frame. However, `dd_hyperbolic_model` and `get_subj_est_ln_k` will still need to be run on the output from `prepare_data_frame`. ```{r} # These methods can be run on the output from simulate_dataset prep_data <- prepare_data_frame(sim_data) rule_check_results <- jb_rule_check(sim_data) # These methods need to be run on the output from prepare_data_frame subj_est_ln_k <- get_subj_est_ln_k(prep_data) sim_model <- dd_hyperbolic_model(prep_data) ``` The output from `simulate_dataset` has one observation per simulated subject per delay for which an indifference point was observed. It has columns `subj`, `true_ln_k`, `group`, `delay`, and `indiff`. The columns other than `true_ln_k` are necessary to run the analysis function on the dataset. The `subj` column identifies the subject, the `group` column identifies the group, the `delay` column identifies the length of time for which the observed indifference point is measured, and the `indiff` column contains the subject's indifference point for the delay expressed as a proportion of the maximum reward. The `true_ln_k` column contains the true $\log(k_{ic})$ parameter of the subject which the observation corresponds to. This can be used to compare estimates of the subject $\log(k_{ic})$ parameters with the true values. For example, the following code produces a plot of the true vs estimated subject $\log(k_{ic})$ parameters, with points colored by rule check result. ```{r} #| fig.alt: > #| Scatterplot showing the true vs estimated log k values for each subject, #| colored by whether or not the subject passed the rule check. The subject #| log k parameter estimates are maximum likelihood estimates. #| The plot shows a line indicating where the estimated log k equals the true #| log k, and the plotted points scatter around that line, since the estimator #| is unbiased. The true log k values of the rule check failures tend to vary #| more than the true log k values of the rule check passers. # Get a data frame consisting of one observation per subject, containing each # subject's true ln(k) subj_true_ln_k <- sim_data %>% select(subj, true_ln_k, group) %>% unique() # Merge the data frames containing the true subject ln_k values, estimated # subject ln_k values, and rule check results. subj_ln_k_comp <- merge(subj_est_ln_k, subj_true_ln_k) subj_ln_k_comp_wrc <- merge(subj_ln_k_comp, rule_check_results) # Get the overall rule check result for each subject subj_ln_k_comp_wrc$rc_pass <- subj_ln_k_comp_wrc$C1 & subj_ln_k_comp_wrc$C2 # Make a plot showing true vs estimated ln(k) for each subject, # coloring the points based on whether or not the corresponding subject # passed the rule check. plot(x = subj_ln_k_comp_wrc$true_ln_k, y = subj_ln_k_comp_wrc$ln_k, col = ifelse(subj_ln_k_comp_wrc$rc_pass, "green", "blue"), main = "Subject True vs Maximum Likelihood Estimated ln(k)", xlab = "True subject ln(k)", ylab = "Estimated subject ln(k)") abline(a=0, b=1) legend("topleft", legend = c( "Passes rule check", "Fails rule check"), col = c("green", "blue"), pch = 1) ```