---
title: "3. Models: Proportion Inference"
author: "David Gerbing"
output:
  rmarkdown::html_vignette:
    toc: true

vignette: >
  %\VignetteIndexEntry{3. Models: Proportion Inference}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r include=FALSE}
suppressPackageStartupMessages(library("lessR"))
```


The analysis of proportions is of two primary types.

- Focus on a single value of a categorical variable, termed a "success" when it occurs, for one or more samples of data. Analyze the resulting proportion of occurrence for a single sample, or for a test of _homogeneity_, compare proportions of successes across distinct data samples for a single variable.
- Compare the obtained proportions across the values of one or more categorical variables for a single sample. Applied to a single variable, the analysis is a _goodness-of-fit_. Or, evaluate a potential relationship between two categorical variables, a test of _independence_.

From standard base R functions, the __lessR__ function `Prop_test()`, abbreviated `prop()`, provides either type of analysis. To use, generally enter either the original data from which to compute the frequencies and then the sample proportions, or enter already computed frequencies. For the analysis of multiple categorical variables across two levels of one of the variables, the test of _homogeneity_ and the test of _independence_ yield the identical statistical result.

The following table summarizes the values of the `Prop_test()` parameters for different analyses of proportions. Each function call for the analysis of data begins with the name of a categorical variable, generically referred to as `X`. The value of `X` is the first parameter in the function definition, and so does not need its parameter name, `variable`. If needed, indicate a second categorical variable, generically referred to as `Y`, with the `by` parameter. If focused on a specific value of `X` as a success, referred to as `X_value`, indicate that value with the `success` parameter.

Run each analysis either directly from pre-computed values of the sample proportions, or from the original data from which the sample proportions are calculated.

Evaluate              |  Data Parameters    |  Count Parameters
----------------- | ------------ | -------------
A hypothesized proportion | X, `success`=X_value | `n_succ`, `n_tot` [scalars]
Equal proportions across samples | X, `success`=X_value, `by`=Y | `n_succ`, `n_tot` [vectors]
Uniform goodness-of-fit | X | `ntot` [vector]
Independence of two variables | X, `by`=Y | `n_table`

The remainder of this vignette illustrates these applications of `Prop_test()`.


## Test a Specified Proportion

Define the occurrence of a designated value of the `variable` as a `success`. Define all other values of the variable as failures. Of course, success or failure in this context does not necessarily mean good or bad, desired or undesired, but instead, a designated value either occurred or did not.

When analyzing proportions from data, first indicate the categorical variable, the value of the parameter `variable`. Next, indicate the designated value of `variable` with the parameter `success`. When entering proportions directly, indicate the number of successes and the total number of trials with the `n_succ` and `n_tot` parameters. Enter the value of each parameter either as a single value for one sample or as a vector of multiple values for multiple samples. Without a value for `success` or `n_succ` the analysis is of goodness-of-fit or independence.


### Single Proportion

The example below is from the documentation for the base R function `binom.test()`, which provides the exact test of a null hypothesis regarding the probability of success. `Prop_test()` uses that base R function to compare a sample proportion to a hypothesized population value.

#### From Input Frequencies

For a given categorical variable of interest, a type of plant, consider two values, either "giant" or "dwarf". From a sample of 925 plants, the specified value of "giant" occurred 682 times and did not occur 243 times. The null hypothesis tested is that the specified value occurs for 3/4 of the population according to the `pi` parameter.

```{r}
Prop_test(n_succ=682, n_fail=243, pi=.75)
```


#### From Data

To illustrate with data, read the _Jackets_ data file included with __lessR__ into the data frame _d_. The file contains two categorical variables. The variable _Bike_ represents two different types of motorcycle: BMW and Honda. The second variable is _Jacket_ with three values of jacket thickness: Lite, Med, and Thick.  Because _d_ is the default name of the data frame that contains the variables for analysis, the `data` parameter that names the input data frame need not be specified.

```{r read}
d <- Read("Jackets")
```

In following example, for the `variable` _Bike_ from the default _d_ data frame, define the parameter `success` as the value _"BMW"_. The default null hypothesis is a population value of 0.5, but here explicitly specify with the parameter `pi`.

For clarity, the following example includes the parameter names listed with their corresponding values. These names are unnecessary in this example, however, because the values are listed in the same order of their definition of the `Prop_test()` function.

```{r}
Prop_test(variable=Bike, success="BMW", pi=0.5)
```

Reject the null hypothesis, with a $p$-value of 0.000, less than $\alpha = 0.05$. The sample result of the sample proportion $p=0.408$ is considered far from the hypothesized value of $0.5$ for the proportion of `"BMW"` values for _Bike_. Conclude that the data were sampled from a population with a population proportion of BMW different from 0.5.


### Multiple Proportions

The following example is from the base R `prop.test()` documentation, which the __lessR__ `Prop_test()` relies upon to compare proportions across different groups.


#### From Input Frequencies

The null hypothesis in this example is that the four populations of _patients_ from which the samples were drawn have the same population proportion of _smokers_. The alternative is that at least one population proportion is different. Label the groups in the output by providing a named vector for the successes.

To indicate multiple proportions across groups, provide multiple values for the `n_succ` and `n_tot` parameters. Optionally, name the groups.

```{r}
smokers <- c(83, 90, 129, 70)
names(smokers) <- c("Group1","Group2","Group3","Group4")
patients <- c(86, 93, 136, 82)
Prop_test(n_succ=smokers, n_tot=patients)
```

The result of the test is $p$-value $=0.006 < \alpha=0.05$, so reject the null hypothesis of equal probabilities across the corresponding four populations. Conclude that at least one of the population proportions of smokers differ.


#### From Data

In the following example, duplicate the previous results, but in this example from data. To illustrate, create the data frame _d_ according to the proportions of smokers and non-smokers with respective values "smoke" and "nosmoke". Of course, in actual data analysis the data would already be available.

```{r}
sm1 <- c(rep("smoke", 83), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm3 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm4 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm <- c(sm1, sm2, sm3, sm4)
grp <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
d <- data.frame(sm, grp)
```

To test if the different groups have the same population proportion of `success`, retain the syntax for a single proportion for the categorical `variable` of interest. Define success by the value of this variable, here _"smoke"_. However, an additional parameter `by` indicates the variable that defines the groups, a variable that contains a label that identifies the corresponding group for each row of data. The grouping variable in this example is _grp_, with values the first four uppercase letters of the alphabet. The first five rows of data are shown below.

```{r}
head(d)
```


The relevant parameters `variable`, `success`, and `by` are listed in their given order in this example, so the parameter names are unnecessary. List the names for clarity.

```{r}
Prop_test(variable=sm, success="smoke", by=grp)
```

The analysis of data that matches the previously input proportions, of course, provides the same results as providing the proportions directly.


## Tests without a Specified Proportion

### Goodness-of-Fit

For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, instead calculate the proportion of occurrence for each value from the total number of occurrences, as one sample from a single population. In addition to the inference test, the following are also reported:
- The observed and expected frequencies
- The residual of expected from observed
- The standardized version of the residual


#### From Input Frequencies

For the goodness-of-fit test to a uniform distribution, provide the frequencies for each group for the parameter `n_tot`. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.

In this example, enter three frequencies as a vector for the `n_tot` parameter value. Optionally, make the vector a named vector to label the output accordingly.

```{r}
x = c(372, 342, 311)
names(x) = c("Lite", "Med", "Thick")
Prop_test(n_tot=x)
```

This example does not quite attain significance at the customary 5\% level, with $p$-value $= 0.066 > \alpha = 0.05$. A difference of the corresponding population proportions was not detected.


#### From Data

The same analysis follows from the data. Just specify the name of the categorical `variable` of interest.

```{r}
d <- Read("Jackets", quiet=TRUE)
Prop_test(Jacket)
```


### Independence

Tests of independence evaluated here rely upon a contingency table of two dimensions also called a cross-tabulation table or joint frequency table. Enter the joint frequencies directly or compute from the data. The corresponding analysis provides the chi-square test for the null hypothesis of independence.

Also provided is Cramer's V to indicate the extent of the relationship of the two categorical variables. For each cell frequency, the expected value given the independence assumption is provided, along with the corresponding residual from the observed frequency and the corresponding standardized residual.


#### From Input Frequencies

To enter the joint frequency table directly, store the frequencies in a file accessible from your computer system. One possibility is to enter the numbers into a text file with file type `.csv` or `.txt`. Enter the numbers with a text editor, or with a word processor saving the file as a text file. This file format separates the adjacent values in each row with a comma, as indicated below. Or, enter the numbers into an MS Excel formatted file with file type `.xlsx`. Enter only the numeric frequencies, no labels.

For example, consider the following joint frequency table with four levels of the column variable and four levels of the row variable, here in `csv` format.

```
3,58,6,105
41,79,9,207
86,179,27,484
143,214,31,824
```

After saving the file, call `Prop_test()` using the parameter `n_table` to indicate the path name to the file, enclosed in quotes. Or, leave the quotes empty to browse for the joint frequency table.

This table is included in a file downloaded with __lessR__ with the name _FreqTable99_. That name triggers an internal process that locates the file within the _lessR_ installation without needing to construct a rather complicated path name as part of this example. That also means that the name becomes a reserved key word with its use always triggering the following example.

In general, replace _FreqTable99_ in this example with your own path name to your file of joint frequencies, or just delete the name leaving only the two quotes to indicate to browse for the file.

```{r}
Prop_test(n_table="FreqTable99")
```

Do not have the path name to your file readily available? Then browse for the file. The following example is not run as it cannot run in this vignette.

```
Prop_test(n_table="")
```

The full path name for the file is provided as part of the output.


#### From Data

The $\chi^2$ test of independence evaluated here applies to two categorical variables. The first categorical variable listed in this example is the value of the parameter `variable`, the first parameter in the function definition, so does not need the parameter name. The second categorical variable listed must include the parameter name `by`.

The question for the analysis is if the observed frequencies of _Jacket_ thickness and _Bike_ ownership sufficiently differ from the frequencies expected by the null hypothesis that we conclude the variables are related.

```{r}
Prop_test(Jacket, by=Bike)
```

The result of this test is that the $p$-value = 0.000 $< \alpha=0.05$, so reject the null hypothesis of independence. Conclude that the type of _Bike_ a person rides and the thickness of their _Jacket_ are related.

To visualize the relationship of the two variables, use the same function call syntax, but now to `Chart()` instead of `Prop_test()`. The visualization is accompanied by the same $\chi^2$ test of independence.


```{r, fig.width=5}
Chart(Jacket, by=Bike)
```

The visualization depicts the relationship between motorcycle and jacket: Honda riders prefer thinner jackets, and BMW riders prefer thicker jackets. To speculate, perhaps because the BMW bikes are sportier, their riders are more concerned with going down on the pavement.

This relationship becomes even clearer to visualize with the corresponding 100% stack bar graph. Each bar representing a jacket choice in this visualization shows the percentage of riders with each type of motorcycle for that jacket.


```{r, fig.width=5}
BarChart(Jacket, by=Bike, stack100=TRUE)
```

From this visualization we see that 24\% of Lite jacket owners are BMW riders, and, in contrast, 62% of the owners of Heavy jackets are  BMW riders.