--- title: "3. Models: Mean Inference, t-test and ANOVA" author: "David Gerbing" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{3. Models: Mean Inference, t-test and ANOVA} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r include=FALSE} suppressPackageStartupMessages(library("lessR")) ``` ```{r, include=FALSE} knitr::opts_chunk$set(fig.width=3.5, fig.height=3) ``` First read the Employee data included as part of __lessR__. ```{r read} d <- Read("Employee") ``` As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a `csv` file or an Excel file. Currently, read the label file into the _l_ data frame. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label. ```{r labels} l <- rd("Employee_lbl") ``` ## One-Sample t-test Obtain the summary statistics and 95% confidence interval for a single variable by specifying that variable with `ttest()`. Because _d_ is the default name of the data frame that contains the variables for analysis, the `data` parameter that names the input data frame need not be specified. ```{r} ttest(Salary) ``` Add a hypothesis test to the above. ```{r, fig.height=3.75, fig.width=5} ttest(Salary, mu=52000) ``` Analysis of the above from summary statistics only. ```{r} ttest(n=37, m=73795.557, s=21799.533, Ynm="Salary", mu=52000) ``` ## Two-Samples t-test ### Independent Groups Full analysis with the `ttest()` function, abbreviated as `tt()`, with formula mode. Specify the response variable to the left of the tilde, `~`, and the categorical variable with two groups, the grouping variable, to the right of the tilde. ```{r, fig.height=4.25, fig.width=5} ttest(Salary ~ Gender) ``` Brief version of the output contains just the basics. ```{r, fig.height=4.25, fig.width=5} tt_brief(Salary ~ Gender) ``` ### Dependent Groups ```{r, fig.height=4.25, fig.width=5} tt_brief(Pre, Post) ``` ## ANOVA Analysis of variance applies to the inferential analysis of means across groups. The __lessR__ function `ANOVA()`, abbreviated `av()`, provides this analysis, based on the base R function `aov()`. The data for these examples is the _warpbreaks_ data set included with the R __datasets__ package. The data are from a weaving device called a loom for a fixed length of yarn. The response variable is the number of times the yarn broke during the weaving. Independent variables are the type of wool -- A or B --and the level of tension -- L, M, or H. Because _warpbreaks_ is not the default data frame, specify with the `data` parameter (or set _d_ equal to _warpbreaks_). ### One-way Independent Groups First, for illustrative purposes, ignore the type of wool and only examine the impact of tension on breaks. The output includes descriptive statistics, ANOVA table, effect size indices, Tukey's multiple comparisons of means, and residuals, as well as the scatterplot of the response variable with the levels of the independent variable, and a visualization of the mean comparisons. ```{r fig.width=5} ANOVA(breaks ~ tension, data=warpbreaks) ``` The brief version forgoes the multiple comparisons and the residuals. ```{r fig.width=5} av_brief(breaks ~ tension, data=warpbreaks) ``` ### Two-way Independent Groups Specify the second independent variable preceded by a `*` sign. The plot of the cell means is generated automatically. ```{r fig.width=5} ANOVA(breaks ~ tension * wool, data=warpbreaks) ``` Can also obtain the cell mean plot directly from the means. Here use __lessR__ `pivot()` to compute the cell means of _breaks_ across _tension_ and _wool_. ```{r} data(warpbreaks) dm <- pivot(warpbreaks, mean, breaks, c(tension, wool)) dm ``` Store the aggregated data in the data frame named _dm_, so explicitly identify with the `data` parameter. The computed mean of _breaks_ variable in the _dm_ data frame from the previous call to `pivot()` is named _mean_ by default. ```{r fig.width=5} Plot(tension, breaks_mean, by=wool, segments=TRUE, size=2, data=dm, main="Cell Means") ``` ### Randomized Block Design The randomized block design has a treatment variable, usually administered over time, and a blocking variable. The values of the treatment variable are measured across each instance of the blocking variable. In this example, repetitions are measured across four different workout sessions. The person takes one of four supplements before each session. Person is the blocking variable, and Supplement is the treatment variable. Repetitions is the response variable. The data are presented in a wide-form data table, a single row for each person. ```{r} d <- read.csv(header=TRUE, text=" Person,sup1,sup2,sup3,sup4 p1,2,4,4,3 p2,2,5,4,6 p3,8,6,7,9 p4,4,3,5,7 p5,2,1,2,3 p6,5,5,6,8 p7,2,3,2,4") ``` The ANOVA, however, requires data to be in long-form. Reshape data from wide form to long form with base R `reshape()` according to the following parameters. With each parameter, either _identify_ existing variables in the given wide-form data, or _name_ newly created variables in the long-form. This R function refers to a time-variable, which in the context of ANOVA is the treatment variable, of which the values occur over time: first treatment, second treatment, etc. The reshaping from a wide-form to a long-form data table creates two new variables: the variable whose values are collected over time, here the blocking variable, Supplement, and the response variable, here Reps. - `idvar`: Identify the existing blocking (within) variable in the wide-form data - `varying`: Identify the wide-form variables, which occur over time, to be gathered into a single variable in long-format - `timevar`: Name the corresponding treatment (factor) variable in the created long-form - `v.names`: Name the response variable in the created long-form There are many ways to identify the names of the wide-form variables to be gathered into a single time-oriented long-form variable. The most general is to specify a vector of the names, here ``` c("sup1", "sup2" "sup3", "sup4") ``` In this example use the __lessR__ `to` function to create that vector without needed to individually list each variable. ```{r} to("sup", 4) ``` ```{r} d <- reshape(d, direction="long", idvar="Person", varying=list(to("sup", 4)), timevar="Supplement", v.names="Reps") ``` Do not need the row names, so remove before displaying new long-form data. ```{r} row.names(d) <- NULL d[1:10,] ``` To run the ANOVA, specify the blocking variable preceded by a `+` sign. ```{r fig.width=6} ANOVA(Reps ~ Supplement + Person) ``` ## Full Manual Use the base R `help()` function to view the full manual for `ttest()` or `ANOVA()`. Simply enter a question mark followed by the name of the function. ``` ?ttest ?ANOVA ```