Help for package Rita

Type:

Package

Title:

Automated Transformations, Normality Testing, and Reporting

Version:

1.2.0

Description:

Automated performance of common transformations used to fulfill parametric assumptions of normality and identification of the best performing method for the user. Output for various normality tests (Thode, 2002) corresponding to the best performing method and a descriptive statistical report of the input data in its original units (5-number summary and mathematical moments) are also presented. Lastly, the Rankit, an empirical normal quantile transformation (ENQT) (Soloman & Sawilowsky, 2009), is provided to accommodate non-standard use cases and facilitate adoption. <doi:10.1201/9780203910894>. <doi:10.22237/jmasm/1257034080>.

License:

MIT + file LICENSE

Encoding:

UTF-8

Imports:

base, stats, lattice

RoxygenNote:

7.1.2

NeedsCompilation:

Packaged:

2022-03-15 14:32:35 UTC; Daniel

Author:

Daniel Mattei [aut, cre], John Ruscio [aut]

Maintainer:

Daniel Mattei <DMattei@live.com>

Repository:

CRAN

Date/Publication:

2022-03-15 14:50:07 UTC

Anderson-Darling Test

Description

This function computes the one-sample Anderson-Darling test statistic and p-value for fit to a normal distribution.

Usage

ADTest(data, alpha = 0.05, j = 1)

Arguments

data

Data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

Details

An adjusted statistic provided by D'agostino & Stephens (1986) is used, where the mean and variance of the population are treated as unknown. D'agostino & Stephen's (1986) text provides the equations used to obtain the function's p-values.

Value

An object including the test statistic, p-value, and a significance flag (list)

References

D'agostino, R. B., & Stephens, M. A. (1986). Goodness-of-fit-techniques (Vol. 68). CRC press.

Examples

values <- rnorm(100)
x <- ADTest(data = values)

D'agostino Pearson Omnibus Test

Description

This function computes the D'agostino Pearson omnibus test using adjusted Fisher- Pearson skewness and kurtosis estimators.

Usage

DPTest(data, alpha = 0.05, j = 1, warn = T)

Arguments

data

Data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

warn

Used for printing a warning message when testing is terminated for N < 8 (boolean)

Value

An object including the test statistic, p-value, and a significance flag (list)

References

D'agostino, R. B., & Stephens, M. A. (1986). Goodness-of-fit-techniques (Vol. 68). CRC press.

D’agostino, R. B., & Belanger, A. (1990). A Suggestion for Using Powerful and Informative Tests of Normality. The American Statistician, 44(4), 316–321. https://doi.org/10.2307/2684359

Shreve, Joni N. and Donna Dea Holland . 2018. SAS® Certification Prep Guide: Statistical Business Analysis Using SAS®9. Cary, NC: SAS Institute Inc.

Examples

values <- rnorm(100)
x <- DPTest(data = values)

Jarque-Bera Test

Description

This function performs the Jarque-Bera test for normality using adjusted Fisher- Pearson skewness and kurtosis coefficients.

Usage

JBTest(data, alpha = 0.05, j = 1, N_Sample = 10000, warn = T)

Arguments

data

Data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

N_Sample

The # samples used to generate the bootstrapped sampling distribution, in cases when N < 2000 (scalar)

warn

Used for printing a warning message when boostrapping is performed for sample-sizes < 2000 or when testing is terminated for N < 4 (boolean)

Details

Large samples (N >= 2000) use p-values obtained with reference to the chi-square distribution, whereas smaller samples output p-values obtained via bootstrapping. When N < 4, testing is terminated.

Value

An object including the test statistic, p-value, and a significance flag (list)

References

Jarque, C. M. and Bera, A. K. (1980). Efficient test for normality, homoscedasticity and serial independence of residuals. Economic Letters, 6(3), pp. 255-259.

Shreve, Joni N. and Donna Dea Holland . 2018. SAS® Certification Prep Guide: Statistical Business Analysis Using SAS®9. Cary, NC: SAS Institute Inc.

Examples

values <- rnorm(100)
x <- JBTest(data = values)

Kolmogorov-Smirnov-Lilliefors Test

Description

This function computes the Lilliefors variant of the one-sample Kolmogorov-Smirnov test.

Usage

KSLTest(data, alpha = 0.05, j = 1, warn = T)

Arguments

data

The data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing (scalar)

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

warn

Used for printing a warning message when negative values are imputed to 0.0 (boolean)

Details

Molin & Abdi's (1998) algorithmic approximation of p-values is used for hypothesis-testing. Note that this algorithm requires the imputation of 0.0 for negative output when p-values would otherwise be low in value (< 0.001) using other methods. A similar issue with extremely large values requires the imputation of 1.0 for values larger than 1.0 when p > .99.

Value

An object including the test statistic, p-value, and a significance flag (list)

References

Lilliefors, H.W. (1967). On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown. Journal of the American Statistical Association, 62, 399-402.

Molin, P., & Abdi, H. (1998). New Tables and numerical approximation for the KolmogorovSmirnov/Lillierfors/Van Soest test of normality.

Examples

values <- rnorm(100)
x <- KSLTest(data = values)

Master Normality Testing Function

Description

This is a master function to call the appropriate test(s) to be used in the 'Rita' function.

Usage

MasterTest(c, data, alpha = 0.05, j = 1)

Arguments

c

Input specifying the test to run (scalar)

data

The data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing (scalar)

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

Value

An results object specific to the test designated with the 'c' argument (list)

Examples

values <- rnorm(100)
x <- MasterTest(c = 1, data = values)

Master Transformation Function

Description

This is a master function used to perform the appropriate transformation(s) within the 'Rita' function.

Usage

MasterXform(c, data)

Arguments

c

Input specifying the test to run (scalar)

data

The data of a univariate distribution for which the test statistic is computed (vector)

Value

Output from the appropriate subfunction (list)

Examples

values <- rnorm(100)
x <- MasterXform(c = 2, data = values)

Rita

Description

R Exploratory Data Analysis (REDA; pronounced "rita") summarizes an input dataset by the M, SD + 5-number summary + third and fourth moments and visualizes the data according to an algorithm or as specified by the user. In addition, Rita will provide the results of one or several normality tests. Lastly, Rita normalizes the dataset with several methods and provides visualizations of the best performing method to the user.

Usage

Rita(
  data,
  test = 1,
  xform = 1,
  alpha = 0.05,
  j = 1,
  autoPlot = T,
  histPlot = F,
  densPlot = F,
  stripPlot = F,
  violinPlot = F,
  xformPlot = F,
  return = T,
  seed = 10
)

Arguments

data

Input dataset (matrix, dataframe, or vector). For a univariate distribution, submit a vector or a subsetted matrix or dataframe. If results for many univariate distributions are desired, submit a matrix or dataframe with each column representing a given variable if all distributions are of the same sample-size. If not, it is recommended to call Rita repeatedly for each variable.

test

Desired normality test (scalar). By default (test = 1), Rita will present the results of the Shapiro-wilk test to the user.

test = 1: Shapiro-Wilk (SW)

test = 2: Kolmogorov-Smirnov/Lilliefors (KSL)

test = 3: Anderson-Darling (AD)

test = 4: Jarque-Bera (JB)

test = 5: D'Agostino Pearson Omnibus (DP)

test = 6: Chi-square test (chiSq)

test = 7: Results of all tests for the best performing transformation

The order of the tests printed corresponds to the order of the variables stored within the input dataset.

xform

Desired normalization method (scalar). By default (xform = 1), Rita will assess which method performs best and (a.) return the transformed data to the user, and (b.) visualize the data according to the settings of the plot argument.

Please note that, per the recommendations of Osborne (2002), a constant is added prior to logarithmic and inverse transformations to ensure that the minimum value is anchored at 1, and prior to the square-root transformation to ensure a left anchor of 0.

Similarly, the arc-sine and logit transformations are applied after converting the units, if needed, to ensure that variables are bounded between 0 and 1.

The "best performing" method is identified by comparing goodness-of-fit to the straight line of the QQ plot for the quantiles of the data normalized by a given method and the standard normal distribution. If a tie is present between transformations for a variable, one of the best performing transformations is arbitrarily selected.

xform = 1: Best performing method is presented (excluding the Rankit)

xform = 2: Logarithmic transform

xform = 3: Inverse/reciprocal transform

xform = 4: Square-root transform

xform = 5: Arc-sine transform

xform = 6: Logit transform

xform = 7: Rankit transform

alpha

The two-sided decision threshold used for normality hypothesis-testing (scalar)

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

autoPlot

Desired plotting method (boolean). By default (plot = 1), the visualization will be implicitly chosen based on extracted features of the dataset.

When autoPlot = F, values of additional plotting arguments are used to determine the visualizations provided to the user.

When autoPlot = T:

Histograms are always generated for discrete data.

Density plots are always generated for continuous data.

Strip plots are generated when the # distinct values are <= 20 AND the # datapoints are 15 <= x <= 150.

Violin plots are instead generated in lieu of the strip plots created when the above conditions are not met.

Lastly, density plots for each (transformed*) variable are generated.

*Transformed variables correspond to the choice made by the user for the xform argument or to the best-performing transformation for each variable when xform = 1.

All plots are drawn in the R console and saved as plotting objects.

histPlot

Whether to generate histograms for each variable (boolean).

densPlot

Whether to generate density plots for each variable (boolean).

stripPlot

Whether to draw strip plots for each variable (boolean).

violinPlot

Whether to draw violin plots for each variable (boolean).

xformPlot

Whether to draw density plots for each transformed variable (boolean).

return

Whether to return the transformed variables of the best performing method (return = T; default), or the cleaned, untransformed variables eligible for transformation (return = F) (boolean).

seed

Number used for reproduction of random number generator results (scalar).

Details

Any rows with missing values (NAs) are removed for calculation purposes; if desired, incomplete records should be imputed or removed with subsetting prior to calling Rita. In addition, note that any columns not numeric type or coercible to numeric are excluded from analysis, as are any numeric columns with 2 distinct values or less.

Value

An object containing the dataset of the best performing transformation for each variable and the specified plots (list)

Examples

values <- rnorm(100)
x <- Rita(data = values)

Shapiro-Wilk Test

Description

This function is a wrapper for shapiro.test() from the stats package. Options added include an ability to toggle a Bonferonni correction for significance, a corresponding significance flag, and reorganized output to facilitate integration with the Rita package.

Usage

SWTest(data, alpha = 0.05, j = 1, warn = T)

Arguments

data

Data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

warn

Used for printing a warning message when resampling is performed on sample-sizes > 5000 or when testing is terminated for N < 3 (boolean)

Details

Note that when the sample-size of the input vector is > 5000, resampling with replacement is used to proceed with hypothesis-testing with a vector of 5000 elements. When N < 3, testing is terminated.

Value

An object including the test statistic, p-value, and a significance flag (list)

References

Patrick Royston (1982). An extension of Shapiro and Wilk's W test for normality to large samples. Applied Statistics, 31, 115–124. 10.2307/2347973

Patrick Royston (1982). Algorithm AS 181: The W test for Normality. Applied Statistics, 31, 176–180. 10.2307/2347986

Patrick Royston (1995). Remark AS R94: A remark on Algorithm AS 181: The W test for normality. Applied Statistics, 44, 547–551. 10.2307/2986146

Examples

values <- rnorm(100)
x <- SWTest(data = values)

Arcsine Transformation

Description

This function transforms the scale, if needed, to values of unity. Then, the data is transformed by taking the arcsine of each value. Per the recommendations of Osborne(2002), data points are left-anchored at 0 to maximize the efficacy of the square-root transformation used enroute to the arcsine.

Usage

arcsineXform(sample)

Arguments

sample

The input data (vector)

Value

The arcsine-transformed data (vector)

References

Osborne, J. W. (2002). Notes on the use of data transformations. Practical Assessment, Research and Evaluation, 9(1), 42-50.

Osborne, J. W. (2002). The Effects of Minimum Values on Data Transformations. Retrieved from https://files.eric.ed.gov/fulltext/ED463313.pdf

Examples

values <- rnorm(100)
x <- arcsineXform(values)

Chi-Square Test

Description

This function computes the chi-square test for normality.

Usage

chisqTest(data, alpha = 0.05, j = 1, df = 3)

Arguments

data

Data of a univariate distribution for which the test statistic is computed (vector)

alpha

The two-sided decision threshold used for hypothesis-testing

j

The # hypotheses tested; used to compute a Bonferonni correction, if applicable; should remain at its default if multiple testing is not an issue (scalar)

df

The degrees of freedom used to test for significance against the sampling distribution (scalar)

Details

Bins are created by cutting the data to ensure that values within these intervals would be equally probable if data are normal (Moore, 1986). By default, this function assumes that all relevant parameters (mu, sigma) are estimators, fixing the degrees of freedom at df = 3.

Value

An object including the test statistic, p-value, and a significance flag (list)

References

Moore, D.S., (1986) Tests of the chi-squared type. In: D'agostino, R.B. and Stephens, M.A., eds.: Goodness-of-Fit Techniques. Marcel Dekker, New York.

Examples

values <- rnorm(100)
x <- chisqTest(data = values)

Inverse/Reciprocal Transformation

Description

This function imputes minimum values per the recommendations of Osborne (2002) and subsequently transforms the data using the reciprocal.

Usage

inverseXform(sample)

Arguments

sample

The input data (vector)

Value

The reciprocal-transformed data (vector)

References

Osborne, J. W. (2002). Notes on the use of data transformations. Practical Assessment, Research and Evaluation, 9(1), 42-50.

Osborne, J. W. (2002). The Effects of Minimum Values on Data Transformations. Retrieved from https://files.eric.ed.gov/fulltext/ED463313.pdf

Examples

values <- rnorm(100)
x <- inverseXform(values)

Adjusted Fisher-Pearson Excess Sample Kurtosis

Description

Adjusted Fisher-Pearson Excess Sample Kurtosis

Usage

kurtCoeff(data, sd)

Arguments

data

The data for which kurtosis is computed (vector)

sd

The population standard deviation, used to compute kurtosis (scalar)

Value

The kurtosis value (scalar)

References

Shreve, Joni N. and Donna Dea Holland . 2018. SAS® Certification Prep Guide: Statistical Business Analysis Using SAS®9. Cary, NC: SAS Institute Inc.

Examples

values <- rnorm(100)
x <- kurtCoeff(data = values, sd = sd(values))

Logarithmic Transformation

Description

This function imputes minimum values per the recommendations of Osborne (2002) and subsequently transforms the data to a base-10 logarithmic scale.

Usage

logXform(sample)

Arguments

sample

The input data (vector)

Value

The log-transformed data (vector)

References

Osborne, J. W. (2002). Notes on the use of data transformations. Practical Assessment, Research and Evaluation, 9(1), 42-50.

Osborne, J. W. (2002). The Effects of Minimum Values on Data Transformations. Retrieved from https://files.eric.ed.gov/fulltext/ED463313.pdf

Examples

values <- rnorm(100)
x <- logXform(values)

Logit/Log-Odds Transformation

Description

This function transforms data via the logit/log-odds transformation.

Usage

logitXform(sample, divisor = 2)

Arguments

sample

The input data (vector, matrix, or dataframe)

divisor

Number used to modify epsilon enroute to the empirical logit, in cases of output consisting of a single distinct value (scalar)

Details

Initially, features of the input data are extracted and used to determine an initial transformation to perform.

All forms of data representing an underlying discrete scale are converted to proportions of the total sample size, if needed. In these cases, values should be stored such that elements are in absolute frequency, relative frequency, or percentage form.

For non-count data, variables are shifted and bounded at [0,1] in a manner analogous to the potential transformations of the scale performed by arcsineXform() prior to the arcine, although transformed values are not expected to outperform more suitable transformations.

Then, the empirical logit transformation is applied to avoid zeroes or ones, and the data are transformed by taking the log-odds/logit of each value.

Value

The logit-transformed data (vector)

References

Stevens, S., Valderas, J. M., Doran, T., Perera, R., & Kontopantelis, E. (2016). Analysing indicators of performance, satisfaction, or safety using empirical logit transformation. bmj, 352.

Osborne, J. W. (2002). Notes on the use of data transformations. Practical Assessment, Research and Evaluation, 9(1), 42-50.

Osborne, J. W. (2002). The Effects of Minimum Values on Data Transformations. Retrieved from https://files.eric.ed.gov/fulltext/ED463313.pdf

Warton, D. I., & Hui, F. K. (2011). The arcsine is asinine: the analysis of proportions in ecology. Ecology, 92(1), 3-10.

Examples

values <- rnorm(100)
x <- logitXform(values)

Converts Sample Standard Deviations into Population Equivalents

Description

This function converts a sample standard deviation (SD) input into the population equivalent. This code is vectorized to convert several sample standard deviations for univariate distributions of identical sample-sizes, if desired.

Usage

popSD(s, n)

Arguments

s

The sample SD(s) (vector)

n

The sample-size for each SD to be converted (vector)

Value

The population SD(s) (vector)

References

Ruscio, J. (2021). Fundamentals of research design and statistical analysis. Ewing, NJ: The College of New Jersey, Psychology Department.

Examples

values <- rnorm(100)
x <- popSD(s = sd(values),n = 100)

Rankit Transformation

Description

This function transforms data via the Rankit, a member of the families of 'rank-based normalization methods' and 'empirical normal quantile transformations' employed in both the social sciences and quantitative genetics.

Usage

rankitXform(sample)

Arguments

sample

The input data (vector)

Value

The Rankit-transformed data (vector)

References

Soloman, S. R., & Sawilowsky, S. S. (2009). Impact of rank-based normalizing transformations on the accuracy of test scores. Journal of Modern Applied Statistical Methods, 8(2), 9.

Peng, B., Robert, K. Y., DeHoff, K. L., & Amos, C. I. (2007, December). Normalizing a large number of quantitative traits using empirical normal quantile transformation. In BMC proceedings (Vol. 1, No. 1, p. S156). BioMed Central. doi: 10.1186/1753-6561-1-s1-s156

Bliss, C. I., Greenwood, M. L., & White, E. S. (1956). A rankit analysis of paired comparisons for measuring the effect of sprays on flavor. Biometrics, 12(4), 381-403.

Examples

values <- rnorm(100)
x <- rankitXform(values)

Adjusted Fisher-Pearson Skewness Coefficient with Sample-size Correction Factor

Description

Adjusted Fisher-Pearson Skewness Coefficient with Sample-size Correction Factor

Usage

skewCoeff(data, sd)

Arguments

data

The data for which skewness is computed (vector)

sd

The population standard deviation, used to compute skewness (scalar)

Value

The skewness value (scalar)

References

Shreve, Joni N. and Donna Dea Holland . 2018. SAS® Certification Prep Guide: Statistical Business Analysis Using SAS®9. Cary, NC: SAS Institute Inc.

Examples

values <- rnorm(100)
x <- skewCoeff(data = values,sd = sd(values))

Square-root Transformation

Description

This function left anchors the minimum value to 0 per the recommendations of Osborne (2002) and subsequently transforms the data by taking the square-root of each value.

Usage

squareXform(sample)

Arguments

sample

The input data (vector)

Value

The square-transformed data (vector)

References

Osborne, J. W. (2002). Notes on the use of data transformations. Practical Assessment, Research and Evaluation, 9(1), 42-50.

Osborne, J. W. (2002). The Effects of Minimum Values on Data Transformations. Retrieved from https://files.eric.ed.gov/fulltext/ED463313.pdf

Examples

values <- rnorm(100)
x <- squareXform(values)