Title: | Variant Quality Investigation Helper |
Version: | 1.0.0 |
Description: | Imports Variant Calling Format file into R. It can detect whether a sample contains contaminant from the same species. In the first stage of the approach, a change-point detection method is used to identify copy number variations for filtering. Next, features are extracted from the data for a support vector machine model. For log-likelihood calculation, the deviation parameter is estimated by maximum likelihood method. Using a radial basis function kernel support vector machine, the contamination of a sample can be detected. |
Depends: | R (≥ 3.4.0) |
Imports: | changepoint, e1071, ggplot2, stats, VGAM |
License: | GPL-2 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 6.0.1 |
NeedsCompilation: | no |
Packaged: | 2018-07-17 19:05:46 UTC; tjiang8 |
Author: | Tao Jiang [aut, cre] |
Maintainer: | Tao Jiang <tjiang8@ncsu.edu> |
Repository: | CRAN |
Date/Publication: | 2018-09-05 14:50:04 UTC |
Default parameters of config.
Description
A dataframe containing default parameters.
Usage
config_df
Format
A data frame with 12 variables:
threshold
Threshold for allele frequency
skew
Skewness for allele frequency
lower
Lower bound for allele frequency region
upper
Upper bound for allele frequency region
ldpthred
Threshold to determine low depth
hom_mle
Hom MLE of p in Beta-Binomial model
het_mle
Het MLE of p in Beta-Binomial model
Hom_thred
Threshold between hom and high
High_thred
Threshold between high and het
Het_thred
Threshold between het and low
hom_rho
Hom MLE of rho in Beta-Binomial model
het_rho
Het MLE of rho in Beta-Binomial model
Source
Created by Tao Jiang
DEtection of Frequency CONtamination
Description
Detects whether a sample is contaminated another sample of its same species. The input file should be in vcf format.
Usage
defcon(file, rmCNV = FALSE, cnvobj = NULL, config = NULL,
class_model = NULL, regression_model = NULL)
Arguments
file |
VCF input object |
rmCNV |
Remove CNV regions, default is FALSE |
cnvobj |
CNV object, default is NULL |
config |
config information of parameters. A default set is generated as part of the model and is included in a model object, which contains |
class_model |
An SVM classification model |
regression_model |
An SVM regression model |
Value
A list containing (1) stat: a data frame with all statistics for contamination estimation; (2) result: contamination estimation (Class = 0, pure; Class = 1, contaminated)
Examples
data(vcf_example)
result <- defcon(file = vcf_example)
Feature Generation for Contamination Detection Model
Description
Generates features from each pair of input VCF objects for training contamination detection model.
Usage
generate_feature(file, hom_p = 0.999, het_p = 0.5, hom_rho = 0.005,
het_rho = 0.1, mixture, homcut = 0.99, highcut = 0.7, hetcut = 0.3)
Arguments
file |
VCF input object |
hom_p |
The initial value for p in Homozygous Beta-Binomial model, default is 0.999 |
het_p |
The initial value for p in Heterozygous Beta-Binomial model, default is 0.5 |
hom_rho |
The initial value for rho in Homozygous Beta-Binomial model, default is 0.005 |
het_rho |
The initial value for rho in Heterozygous Beta-Binomial model, default is 0.1 |
mixture |
A vector of whether the sample is contaminated: 0 for pure; 1 for contaminated |
homcut |
Cutoff allele frequency value between hom and high, default is 0.99 |
highcut |
Cutoff allele frequency value between high and het, default is 0.7 |
hetcut |
Cutoff allele frequency value between het and low, default is 0.3 |
Value
A data frame with all features for training model of contamination detection
Second alternative allele percentage
Description
Second alternative allele percentage
Usage
getAlt2(f)
Arguments
f |
Input raw file |
Value
Percent of the second alternative allele
Annotation rate
Description
Annotation rate
Usage
getAnnoRate(f)
Arguments
f |
Input raw file |
Value
Percentage of annotation locus
Calculate average log-likelihood
Description
Calculate average log-likelihood
Usage
getAvgLL(df, hom_mle, het_mle, hom_rho, het_rho)
Arguments
df |
Input modified file |
hom_mle |
Hom MLE of p in Beta-Binomial model, default is 0.9981416 from NA12878_1_L5 |
het_mle |
Het MLE of p in Beta-Binomial model, default is 0.4737897 from NA12878_1_L5 |
hom_rho |
Hom MLE of rho in Beta-Binomial model, default is 0.04570275 from NA12878_1_L5 |
het_rho |
Het MLE of rho in Beta-Binomial model, default is 0.02224098 from NA12878_1_L5 |
Value
meanLL
Low depth percentage
Description
Low depth percentage
Usage
getLowDepth(f, ldpthred)
Arguments
f |
Input raw file |
ldpthred |
Threshold to determine low depth, default is 20 |
Value
Percentage of low depth
Get the ratio of allele frequencies with a region
Description
Get the ratio of allele frequencies with a region
Usage
getRatio(subdf, lower, upper)
Arguments
subdf |
Dataframe with calculated statistics |
lower |
Lower bound for allele frequency region |
upper |
Upper bound for allele frequency region |
Value
Ratio of allele frequencies with a region
SNV percentage
Description
SNV percentage
Usage
getSNVRate(df)
Arguments
df |
Input raw file |
Value
Percentage of SNV
Get absolute value of skewness
Description
Get absolute value of skewness
Usage
getSkewness(subdf)
Arguments
subdf |
Input dataframe |
Value
Absolute value of skewness
Calculate zygosity variable
Description
Calculate zygosity variable
Usage
getVar(df, state, hom_mle, het_mle)
Arguments
df |
Input modified file |
state |
Zygosity state |
hom_mle |
MLE in hom model |
het_mle |
MLE in het model |
Value
Zygosity variable
Check input filename
Description
Check input filename
Usage
locateFile(fn, extension)
Arguments
fn |
Exact full file name of input file, including directory |
extension |
Expected input file extension: vcf & txt |
Value
Valid directory
Negative Log Likelihood
Description
Calculates negative log likelihood for beta binomial distribution.
Usage
negll(x, size, prob, rho)
Arguments
x |
Depth of alternative allele |
size |
Total depth |
prob |
Theoretical probability for heterozygous is 0.5, for homozygous is 0.999 |
rho |
Rho parameter of Beta-Binomial distribution of alternative allele |
Read in input vcf data in GATK format for Contamination detection
Description
Read in input vcf data in GATK format for Contamination detection
Usage
readGATK(dr, dbOnly, depCut, thred, content, extnum, keepall)
Arguments
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number or numbers to be extracted from vcf, default is 10; 0 for not extracting any columns |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Value
Dataframe from VCF file
Read in input vcf data in strelka2 format for Contamination detection
Description
Read in input vcf data in strelka2 format for Contamination detection
Usage
readStrelka(dr, dbOnly, depCut, thred, content, extnum, keepall)
Arguments
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number or numbers to be extracted from vcf, default is 10; 0 for not extracting any columns |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Value
Dataframe from VCF file
Read in input vcf data in VarDict format for Contamination detection
Description
Read in input vcf data in VarDict format for Contamination detection
Usage
readVarDict(dr, dbOnly, depCut, thred, content, extnum, keepall)
Arguments
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number to be extracted from vcf, default is 10; 0 for not extracting any column |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Value
Dataframe from VCF file
Read in input vcf data in VarPROWL format
Description
Read in input vcf data in VarPROWL format
Usage
readVarPROWL(dr, dbOnly, depCut, thred, content, extnum, keepall)
Arguments
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number or numbers to be extracted from vcf, default is 10; 0 for not extracting any columns |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Value
vcf Dataframe from VCF file
VCF Data Input
Description
Reads a file in vcf or vcf.gz file and creates a list containing Content, Meta, VCF and file_sample_name
Usage
read_vcf(fn, vcffor, dbOnly = FALSE, depCut = FALSE, thred = 20,
metaline = 200, extnum = 10, keepall = TRUE, filter = FALSE)
Arguments
fn |
Input vcf file name |
vcffor |
Input vcf data format: 1) GATK; 2) VarPROWL; 3) VarDict; 4) strelka2 |
dbOnly |
Use dbSNP as filter, default is FALSE |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
metaline |
Number of head lines to read in (better to be large enough), the lines will be checked if they contain meta information, default is 200 |
extnum |
The column number to be extracted from vcf, default is 10; 0 for not extracting any column; extnum should be between 10 and total column number |
keepall |
Keep unextracted column in output, default is TRUE |
filter |
Whether to select "PASS" variants for analyses if they contain unfiltered variants, default is FALSE |
Value
A list containing (1) Content: a vector showing what is contained; (2) Meta: a data frame containing meta-information of the file; (3) VCF: a data frame, the main part of VCF file; (4) file_sample_name: the file name and sample name, in case when multiple samples exist in one file, file and sample names might be different
Examples
file.name <- system.file("extdata", "example.vcf.gz", package = "vanquish")
example <- read_vcf(fn=file.name, vcffor="VarPROWL")
Estimate Rho for Alternative Allele Frequency
Description
Estimates Rho parameter in beta binomial distribution for alternative allele frequency
Usage
rho_est(vl)
Arguments
vl |
A list of vcf objects from read_vcf function. |
Value
A list containing (1) het_rho: Rho parameter of heterozygous location; (2) hom_rho: Rho parameter homozygous location;
Examples
data("vcf_example")
vcf_list <- list()
vcf_list[[1]] <- vcf_example$VCF
res <- rho_est(vl = vcf_list)
res$het_rho[[1]]$par
res$hom_rho[[1]]$par
Remove CNV regions within VCF files given cnv file
Description
Remove CNV regions within VCF files given cnv file
Usage
rmCNVinVCF(vcf, cnvobj)
Arguments
vcf |
Input VCF files |
cnvobj |
cnv object |
Value
VCF object without change point region
Remove CNV regions within VCF files by change point method
Description
Remove CNV regions within VCF files by change point method
Usage
rmChangePoint(vcf, threshold, skew, lower, upper)
Arguments
vcf |
Input VCF files |
threshold |
Threshold for allele frequency |
skew |
Skewness for allele frequency |
lower |
Lower bound for allele frequency region |
upper |
Upper bound for allele frequency region |
Value
VCF object without change point region
VCF Data Summary
Description
Summarizes allele frequency information in scatter and density plots
Usage
summary_vcf(vcf, ZG = NULL, CHR = NULL)
Arguments
vcf |
VCF object from read_vcf function |
ZG |
zygosity: (1) null, for both het and hom, default; (2) het; (3) hom |
CHR |
chromosome number: (1) null, all chromosome, default; (2) any specific number |
Value
A list containing (1) scatter: allele frequency scatter plot; (2) density: allele frequency density plot
Examples
data("vcf_example")
tmp <- summary_vcf(vcf = vcf_example, ZG = 'het', CHR = c(1,2))
plot(tmp$scatter)
plot(tmp$density)
Default svm classification model.
Description
An svm object containing default svm classification model.
Usage
svm_class_model
Format
An svm object:
Source
Created by Tao Jiang
Default svm regression model.
Description
An svm object containing default svm regression model.
Usage
svm_regression_model
Format
An svm object:
Source
Created by Tao Jiang
Train Contamination Detection Model
Description
Trains two SVM models (classification and regression) to detects whether a sample is contaminated another sample of its same species.
Usage
train_ct(feature)
Arguments
feature |
Feature list objects from generate_feature() |
Value
A list contains two trained svm models: regression & classification
Remove CNV regions within VCF files
Description
Remove CNV regions within VCF files
Usage
update_vcf(rmCNV = FALSE, vcf, cnvobj = NULL, threshold = 0.1,
skew = 0.5, lower = 0.45, upper = 0.55)
Arguments
rmCNV |
Remove CNV regions, default is FALSE |
vcf |
Input VCF files |
cnvobj |
cnv object, default is NULL |
threshold |
Threshold for allele frequency, default is 0.1 |
skew |
Skewness for allele frequency, default is 0.5 |
lower |
Lower bound for allele frequency region, default is 0.45 |
upper |
Upper bound for allele frequency region, default is 0.55 |
Value
VCF file without CNV region
VCF example file.
Description
An example containing a list of 4 data frames.
Usage
vcf_example
Format
A list of 4 data frames:
Source
Created by Tao Jiang