Title: | Generate Suggestions for Validation Rules |
Version: | 0.3.2 |
Description: | Generate suggestions for validation rules from a reference data set, which can be used as a starting point for domain specific rules to be checked with package 'validate'. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
Imports: | validate, whisker, rpart |
URL: | https://github.com/data-cleaning/validatesuggest |
BugReports: | https://github.com/data-cleaning/validatesuggest/issues |
Depends: | R (≥ 2.10) |
Suggests: | knitr, rmarkdown, tinytest |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2023-10-06 10:24:39 UTC; edwin |
Author: | Edwin de Jonge |
Maintainer: | Edwin de Jonge <edwindjonge@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-10-06 16:40:02 UTC |
validatesuggest: Generate Suggestions for Validation Rules
Description
Generate suggestions for validation rules from a reference data set, which can be used as a starting point for domain specific rules to be checked with package 'validate'.
validatesuggest
The goal of validatesuggest is to generate suggestions for validation rules from a supplied dataset. These can be used as a starting point for a rule set and are to be adjusted by domain experts.
Author(s)
Maintainer: Edwin de Jonge edwindjonge@gmail.com (ORCID)
Authors:
Olav ten Bosch
See Also
Useful links:
Report bugs at https://github.com/data-cleaning/validatesuggest/issues
Car owners data set (fictitious).
Description
A constructed data set useful for detecting conditinal dependencies.
Usage
car_owner
Format
A data frame with 200 rows and 4 variables. Each row is a person with:
- age
age of person
- driver_license
has a driver license, only persons older then 17 can have a license in this data set
- income
monthly income
- owns_car
only persons with a drivers license , and a monthly income > 1500 can own a car
- car_color
NA when there is no car
Examples
data("car_owner")
rules <- suggest_cond_rule(car_owner)
rules$rules
Suggest rules
Description
Suggests rules using the various suggestion checks.
Use the more specific suggest
functions for more control.
Usage
suggest_rules(
d,
vars = names(d),
domain_check = TRUE,
range_check = TRUE,
pos_check = TRUE,
type_check = TRUE,
na_check = TRUE,
unique_check = TRUE,
ratio_check = TRUE,
conditional_rule = TRUE
)
suggest_all(
d,
vars = names(d),
domain_check = TRUE,
range_check = TRUE,
pos_check = TRUE,
type_check = TRUE,
na_check = TRUE,
unique_check = TRUE,
ratio_check = TRUE,
conditional_rule = TRUE
)
write_all_suggestions(
d,
vars = names(d),
file = stdout(),
domain_check = TRUE,
range_check = TRUE,
type_check = TRUE,
pos_check = TRUE,
na_check = TRUE,
unique_check = TRUE,
ratio_check = TRUE,
conditional_rule = TRUE
)
Arguments
d |
|
vars |
|
domain_check |
if |
range_check |
if |
pos_check |
if |
type_check |
if |
na_check |
if |
unique_check |
if |
ratio_check |
if |
conditional_rule |
if |
file |
file to which the checks will be written to. |
Value
returns validate::validator()
object with the suggested rules.
write_all_suggestions
write the rules to file and returns invisibly a named list of ranges for each variable.
task2 dataset
Description
Fictuous test data set from European (ESSnet) project on validation 2017.
Usage
task2
Format
- ID
ID
- Age
Age of person
- Married
Marital status
- Employed
Employed or not
- Working_hours
Working hours
References
European (ESSnet) project on validation 2017
Suggest a conditional rule
Description
Suggest a conditional rule based on a association rule. This functions derives conditional rules based on the non-existance of combinations of categories in pairs of variables. For each numerical variable a logical variable is derived that tests for positivity. It generates IF THEN rules based on two variables.
Usage
write_cond_rule(d, vars = names(d), file = stdout())
suggest_cond_rule(d, vars = names(d))
Arguments
d |
|
vars |
|
file |
file to which the checks will be written to. |
Value
suggest_cond_rule
returns validate::validator()
object with the suggested rules.
write_cond_rule
returns invisibly a named list of ranges for each variable.
Examples
data(retailers, package="validate")
# will generate check for all columns in retailers that are
# complete.
suggest_na_check(retailers)
data("car_owner")
rules <- suggest_cond_rule(car_owner)
rules$rules
Suggest a range check
Description
Suggest a range check
Usage
write_domain_check(d, vars = names(d), only_positive = TRUE, file = stdout())
suggest_domain_check(d, vars = names(d), only_positive = TRUE)
Arguments
d |
|
vars |
|
only_positive |
if |
file |
file to which the checks will be written to. |
Value
suggest_domain_check
returns validate::validator()
object with the suggested rules.
write_domain_check
returns invisibly a named list of checks for each variable.
Examples
data(SBS2000, package="validate")
suggest_range_check(SBS2000)
# checks the ranges of each variable
suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE)
# checks the ranges of each variable
suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
Suggest a check for completeness.
Description
Suggest a check for completeness.
Usage
write_na_check(d, vars = names(d), file = stdout())
suggest_na_check(d, vars = names(d))
Arguments
d |
|
vars |
|
file |
file to which the checks will be written to. |
Value
suggest_na_check
returns validate::validator()
object with the suggested rules.
write_na_check
write the rules to file and returns invisibly a named list of ranges for each variable.
Examples
data(retailers, package="validate")
# will generate check for all columns in retailers that are
# complete.
suggest_na_check(retailers)
Suggest a range check
Description
Suggest a range check
Usage
write_pos_check(d, vars = names(d), only_positive = TRUE, file = stdout())
suggest_pos_check(d, vars = names(d), only_positive = TRUE)
Arguments
d |
|
vars |
|
only_positive |
if |
file |
file to which the checks will be written to. |
Value
suggest_pos_check
returns validate::validator()
object with the suggested rules.
write_pos_check
write the rules to file and returns invisibly a named list of checks for each variable.
Examples
data(SBS2000, package="validate")
suggest_range_check(SBS2000)
# checks the ranges of each variable
suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE)
# checks the ranges of each variable
suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
Suggest a range check
Description
Suggest a range check
Usage
write_range_check(d, vars = names(d), min = TRUE, max = FALSE, file = stdout())
suggest_range_check(d, vars = names(d), min = TRUE, max = FALSE)
Arguments
d |
|
vars |
|
min |
|
max |
|
file |
file to which the checks will be written to. |
Value
suggest_range_check
returns validate::validator()
object with the suggested rules.
write_range_check
write the rules to file and returns invisibly a named list of ranges for each variable.
Examples
data(SBS2000, package="validate")
suggest_range_check(SBS2000)
# checks the ranges of each variable
suggest_range_check(SBS2000[-1], min=TRUE, max=TRUE)
# checks the ranges of each variable
suggest_range_check(SBS2000, vars=c("turnover", "other.rev"), min=FALSE, max=TRUE)
Suggest ratio checks
Description
Suggest ratio checks
Usage
write_ratio_check(
d,
vars = names(d),
file = stdout(),
lin_cor = 0.95,
digits = 2
)
suggest_ratio_check(d, vars = names(d), lin_cor = 0.95, digits = 2)
Arguments
d |
|
vars |
|
file |
file to which the checks will be written to. |
lin_cor |
threshold for abs correlation to be included (details) |
digits |
number of digits for rounding |
Value
suggest_ratio_check
returns validate::validator()
object with the suggested rules.
write_ratio_check
write the rules to file and returns invisibly a named list of check for each variable.
Examples
data(SBS2000, package="validate")
# generates upper and lower checks for the
# ratio of two variables if their correlation is
# bigger then `lin_cor`
suggest_ratio_check(SBS2000, lin_cor=0.98)
suggest type check
Description
suggest type check
Usage
write_type_check(d, vars = names(d), file = stdout())
suggest_type_check(d, vars = names(d))
Arguments
d |
|
vars |
|
file |
file to which the checks will be written to. |
Value
suggest_type_check
returns validate::validator()
object with the suggested rules.
write_type_check
write the rules to file and returns invisibly a named list of types for each variable.
Suggest range checks
Description
Suggest range checks
Usage
write_unique_check(d, vars = names(d), file = stdout(), fraction = 0.95)
suggest_unique_check(d, vars = names(d), fraction = 0.95)
Arguments
d |
|
vars |
|
file |
file to which the checks will be written to. |
fraction |
if values in a column > |
Value
suggest_unique_check
returns validate::validator()
object with the suggested rules.
write_unique_check
write the rules to file and returns invisibly a named list of checks for each variable.