Version: | 0.2.4 |
Date: | 2020-10-22 |
Title: | Identifying Functional Polymorphisms |
Author: | Park L |
Maintainer: | Leeyoung Park <lypark@yonsei.ac.kr> |
Description: | A suite for identifying causal models using relative concordances and identifying causal polymorphisms in case-control genetic association data, especially with large controls re-sequenced data. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Depends: | R (≥ 2.11.1) |
Imports: | haplo.stats,coda |
URL: | https://www.r-project.org |
NeedsCompilation: | yes |
Repository: | CRAN |
Packaged: | 2020-11-25 13:35:07 UTC; L |
Date/Publication: | 2020-11-26 00:50:05 UTC |
Allele Frequency Computation from Genotype Data
Description
Computes allele frequencies from genotype data.
Usage
allele.freq(geno)
Arguments
geno |
matrix of alleles, such that each locus has a pair of adjacent columns of alleles, and the order of columns corresponds to the order of loci on a chromosome. If there are K loci, then ncol(geno) = 2*K. Rows represent the alleles for each subject. Each allele shoud be represented as numbers (A=1,C=2,G=3,T=4). |
Value
array of allele frequencies of each SNP. The computed allele is targeted as an order of alleles, "A", "C", "G", and "T".
Examples
data(apoe)
allele.freq(apoe7)
allele.freq(apoe)
Allele Frequency Computation from the sequencing data with a vcf type of the 1000 Genomes Project
Description
Computes allele frequencies from the sequencing data with a vcf type of the 1000 Genomes Project.
Usage
allele.freq.G(genoG)
Arguments
genoG |
matrix of haplotypes. Each row indicates a variant, and each column ind icates a haplotype of an individual. Two alleles of 0 and 1 are available. |
Value
array of allele frequencies of each variant.
Examples
data(apoeG)
allele.freq.G(apoeG)
Genetic data of APOE gene region
Description
This data set came from a re-sequenced data of APOE gene region in the Molecular Diversity and Epidemiology of Common Disease (MDECODE) database. Sixteen polymorphic sites were included. "apoe7" data contains the genetic data of seven single nucleotide polymorphisms with allele frequencies higher than 0.1 from the apoe data.
Usage
data(apoe)
Format
A matrix with 48 rows and 32 columns
Source
http://droog.gs.washington.edu/mdecode/
References
Nickerson, D. A., S. L. Taylor, S. M. Fullerton, K. M. Weiss, A. G. Clark et al. (2000) Sequence diversity and large-scale typing of SNPs in the human apolipoprotein E gene. Genome Res 10: 1532-1545.
Sequencing data of APOE gene region from the 1000 Genomes Project
Description
This data set came from a re-sequenced data of APOE gene region from the 1000 Genomes Project. Thirty three polymorphic sites with allele frequencies higher than 0.001 were included for the original data set, apoeG. The test data sets, apoeT and apoeC, indicate the data of 100 controls and 100 cases respectively when the dominant variant is 15th variant with the odds ratio of 3.
Usage
data(apoeG)
Format
A matrix with 33 rows and 2184 columns
Source
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/
References
Abecasis, G. R. et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073.
causal models with all possible causal factors: G, G*G, G*E and E
Description
provides concordance probabilities of relative pairs for a causal model with G, G*G, G*E and E components
Usage
drgegggne(fdg,frg,fdgg,frgg,fdge,frge,eg,e)
Arguments
fdg |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G component |
frg |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G component |
fdgg |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G*G component |
frgg |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G*G component |
fdge |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G*E component |
frge |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G*E component |
eg |
a proportion of population who are exposed to environmental cause of G*E interactiong the genetic cause of G*E during their entire life |
e |
a proportion of population who are exposed to environmental cause during their entire life |
Value
matrix of NN, ND, and DD probabilities of 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild)
See Also
drggn drgegne
Examples
### PLI=0.01.
ppt<-0.01
### for a model without one or more missing causal factors,
### set the relevant parameters as zero.
pg<-0.002 # the proportion of G component in total populations
pgg<-0.002 # the proportion of G*G component in total populations
pge<-0.003 # the proportion of G*E component in total populations
e<-1-(1-ppt)/(1-pg)/(1-pgg)/(1-pge)
# the proportion of E component in total populations
fd<-0.001 # one dominant gene
tt<-3 # the number of recessive genes
temp<-sqrt(1-((1-pg)/(1-fd)^2)^(1/tt))
fr<-c(array(0,length(fd)),array(temp,tt))
fd<-c(fd,array(0,tt))
ppd<-sqrt(pgg)
fdg<-array(1-sqrt(1-ppd^(1/2)),2)
ttg<-1
temp<-(pgg/ppd)^(1/2/ttg)
frg<-c(array(0,length(fdg)),array(temp,ttg))
fdg<-c(fdg,array(0,ttg))
ppe<-0.5
ppg<-pge/ppe
fdge<-0.002
ttge<-2 # the number of recessive genes
temp<-sqrt(1-((1-ppg)/(1-fdge)^2)^(1/ttge))
frge<-c(array(0,length(fdge)),array(temp,ttge))
fdge<-c(fdge,array(0,ttge))
drgegggne(fd,fr,fdg,frg,fdge,frge,ppe,e)
causal models with three possible causal factors: G, G*E and E
Description
provides concordance probabilities of relative pairs for a causal model with G, G*E and E components
Usage
drgegne(fdg,frg,fdge,frge,eg,e)
Arguments
fdg |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G component |
frg |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G component |
fdge |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G*E component |
frge |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G*E component |
eg |
a proportion of population who are exposed to environmental cause of G*E interactiong the genetic cause of G*E during their entire life |
e |
a proportion of population who are exposed to environmental cause during their entire life |
Value
matrix of NN, ND, and DD probabilities of 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild)
See Also
drgn drgene
Examples
### PLI=0.01.
ppt<-0.01
pg<-0.002 # the proportion of G component in total populations
pge<-0.005 # the proportion of G*E component in total populations
e<-1-(1-ppt)/(1-pg)/(1-pge)
# the proportion of E component in total populations
fd<-0.001 # one dominant gene
tt<-2 # the number of recessive genes
temp<-sqrt(1-((1-pg)/(1-fd)^2)^(1/tt))
fr<-c(array(0,length(fd)),array(temp,tt))
fd<-c(fd,array(0,tt))
ppe<-0.5
ppg<-pge/ppe
fdge<-0.002
ttge<-2 # the number of recessive genes
temp<-sqrt(1-((1-ppg)/(1-fdge)^2)^(1/ttge))
frge<-c(array(0,length(fdge)),array(temp,ttge))
fdge<-c(fdge,array(0,ttge))
drgegne(fd,fr,fdge,frge,ppe,e)
causal models with G*E
Description
provides concordance probabilities of relative pairs for a causal model with G*E component
Usage
drgen(fd,fr,e)
Arguments
fd |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G component of G*E interacting with E of G*E |
fr |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G component of G*E interacting with E of G*E |
e |
a proportion of population who are exposed to environmental cause of G*E interacting with genetic cause of G*E during their entire life |
Value
a list of the g*e proportion in population and a matrix of NN, ND, and DD probabilities of 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild)
See Also
drgene.gm
Examples
### PLI=0.01.
ppt<-0.01
### g*e model
pge<-ppt # the proportion of G*E component in total populations
ppe<-0.5
ppg<-pge/ppe
fd<-0.0005 # one dominant gene
tt<-3 # the number of recessive genes
temp<-sqrt(1-((1-ppg)/(1-fd)^2)^(1/tt))
fr<-c(array(0,length(fd)),array(temp,tt))
fd<-c(fd,array(0,tt))
drgen(fd,fr,ppe)
causal models with G*E and E
Description
provides concordance probabilities of relative pairs for a causal model with G*E and E components
Usage
drgene(fdg,frg,eg,e)
Arguments
fdg |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G component of G*E interacting with E of G*E |
frg |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G component of G*E interacting with E of G*E |
eg |
a proportion of population who are exposed to environmental cause of G*E interacting with genetic cause of G*E during their entire life |
e |
a proportion of population who are exposed to environmental cause during their entire life |
Value
matrix of NN, ND, and DD probabilities of 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild)
See Also
drgen.gm
Examples
### PLI=0.01.
ppt<-0.01
### g*e+e model
pge<-0.007 # the proportion of G*E component in total populations
e<-1-(1-ppt)/(1-pge) # the proportion of E component in total populations
ppe<-0.5
ppg<-pge/ppe
fd<-0.0005 # one dominant gene
tt<-3 # the number of recessive genes
temp<-sqrt(1-((1-ppg)/(1-fd)^2)^(1/tt))
fr<-c(array(0,length(fd)),array(temp,tt))
fd<-c(fd,array(0,tt))
drgene(fd,fr,ppe,e)
causal models with G*G
Description
provides concordance probabilities of relative pairs for a causal model with G*G component
Usage
drggn(fd,fr)
Arguments
fd |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G*G component |
fr |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G*G component |
Value
a list of PLI and a matrix of NN, ND, and DD probabilities of 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild)
See Also
drgegggne
Examples
### PLI=0.01.
ppt<-0.01
### g*g model
pp<-ppt # the proportion of G*G component in total populations
gd<-sqrt(pp) # dominant gene proportion = recessive gene proportion
fd<-array(1-sqrt(1-gd^(1/2)),2) # two dominant genes
tt<-2 # the number of recessive genes: 2
temp<-(pp/gd)^(1/2/tt)
fr<-c(array(0,length(fd)),array(temp,tt))
fd<-c(fd,array(0,tt))
drggn(fd,fr)
causal models with G
Description
provides concordance probabilities of relative pairs for a causal model with G component
Usage
drgn(fd,fr)
Arguments
fd |
an array (size=number of dominant genes+recessive genes) of dominant gene frequencies including 0 values of recessive genes of G component |
fr |
an array (size=number of dominant genes+recessive genes) of recessive gene frequencies including 0 values of dominant genes of G component |
Value
list of the value of PLI and the matrix of NN, ND, and DD probabilities of 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild)
See Also
drgegne.gm
Examples
### PLI=0.01.
ppt<-0.01
### g model
pp<-ppt # the proportion of G component in total populations
fdt<-0.001 # one dominant gene with frequency of 0.001
tt<-5 # the number of recessive genes: 5
fd<-c(fdt,array(0,tt))
temp<-sqrt(1-((1-pp)/(1-fdt)^2)^(1/tt))
fr<-c(0,array(temp,tt))
drgn(fd,fr)
Error Rates Estimation for Likelihood Ratio Tests Designed for Identifying Number of Functional Polymorphisms
Description
Compute error rates for a given model.
Usage
error.rates(H0,Z, pMc, geno, no.ca, no.con=nrow(geno), sim.no = 1000)
Arguments
H0 |
the index number for a given model for functional SNPs |
Z |
number of functional SNPs for the given model |
pMc |
array of allele frequencies of case samples |
geno |
matrix of alleles, such that each locus has a pair of adjacent columns of alleles, and the order of columns corresponds to the order of loci on a chromosome. If there are K loci, then ncol(geno) = 2*K. Rows represent the alleles for each subject. Each allele shoud be represented as numbers (A=1,C=2,G=3,T=4). |
no.ca |
number of case chromosomes |
no.con |
number of control chromosomes |
sim.no |
number of simulations for error rates estimation |
Value
array of results consisted of Type I error rate (alpha=0.05), Type I error rate (alpha=0.01), Type II error rate (beta=0.05), Type II error rate (beta=0.01), percent when the target model has the lowest corrected -2 log likelihood ratio.
See Also
allele.freq hap.freq lrtB
Examples
## LRT tests when SNP1 & SNP6 are the functional polymorphisms.
data(apoe)
n<-c(2000, 2000, 2000, 2000, 2000, 2000, 2000) #case sample size = 1000
x<-c(1707, 281,1341, 435, 772, 416, 1797) #allele numbers in case samples
Z<-2 #number of functional SNPs for tests
n.poly<-ncol(apoe7)/2 #total number of SNPs
#index number for the model in this case is 5 for SNP1 and 6.
#apoe7 is considered to represent the true control allele and haplotype frequencies.
#Control sample size = 1000.
error.rates(5, 2, x/n, apoe7, 2000, 2000, sim.no=2)
# to obtain valid rates, use sim.no=1000.
Genotype Frequency Computation from the sequencing data with a vcf type of the 1000 Genomes Project
Description
Computes genotype frequencies from the sequencing data with a vcf type of the 1000 Genomes Project.
Usage
geno.freq(genoG)
Arguments
genoG |
matrix of haplotypes. Each row indicates a variant, and each column ind icates a haplotype of an individual. Two alleles of 0 and 1 are available. |
Value
matrix of genotype frequencies of each variant.
Examples
data(apoeG)
geno.freq(apoeG)
Conversion to Genotypes from Alleles using the sequencing data with a vcf type of the 1000 Genomes Project
Description
Convert sequencing data to genotypes.
Usage
genotype(genoG)
Arguments
genoG |
matrix of haplotypes. Each row indicates a variant, and each column ind icates a haplotype of an individual. Two alleles of 0 and 1 are available. |
Value
matrix of genotypes with rows of variants and with columns of individuals.
Examples
data(apoeG)
genotype(apoeG)
Estimation of Haplotype Frequencies with Two SNPs
Description
EM computation of haplotype frequencies with two SNPs. The computation is relied on the package"haplo.stats".
Usage
hap.freq(geno)
Arguments
geno |
matrix of alleles, such that each locus has a pair of adjacent columns of alleles, and the order of columns corresponds to the order of loci on a chromosome. If there are K loci, then ncol(geno) = 2*K. Rows represent the alleles for each subject. Each allele shoud be represented as numbers (A=1,C=2,G=3,T=4). |
Value
matrix of haplotype frequencies consisted of two alleles from each SNP. These alleles are the same ones computed for frequency using the function "allele.freq".
See Also
allele.freq
Examples
data(apoe)
hap.freq(apoe7)
hap.freq(apoe)
mcmc inference of causal models with all possible causal factors: G, G*G, G*E and E
Description
provides proportions of each causal factor of G, G*G, G*E and E based on relative concordance data
Usage
iter.mcmc(ppt,aj=2,n.iter,n.chains,thinning=5,init.cut,darray,x,n,model,mcmcrg=0.01)
Arguments
ppt |
population lifetime incidence |
aj |
a constant for the stage of data collection |
n.iter |
number of mcmc iterations |
n.chains |
number of mcmc chain |
thinning |
mcmc thinning parameter (default=5) |
init.cut |
mcmc data cut |
darray |
indicating the array positions of available data among 9 relative pairs: 1:mzt,2:parent-offspring,3:dzt,4:sibling,5:2-direct(grandparent-grandchild),6:3rd(uncle-niece),7:3-direct(great-grandparent-great-grandchild),8:4th (causin),9:4d(great-great-grandparent-great-great-grandchild) |
x |
number of disease concordance of relative pairs |
n |
total number of relative pairs |
model |
an array, size of 4 (1: E component; 2: G component; 3: G*E component; 4: G*G component), indicating the existance of the causal component: 0: excluded; 1: included. |
mcmcrg |
parameter of the data collection stage (default=0.01) |
Value
a list of rejectionRate, result summary, Gelman-Rubin diagnostics (point est. & upper C.I.) for output variables: e[1]: proportion of environmental factor (E) g[2]: proportion of genetic factor (G) ge[3]: proportion of gene-environment interaction (G*E) gg[4]: proportion of gene interactions (G*G) gn[5]: number of recessive genes in G ppe[6]: population proportion of interacting environment in G*E ppg[7]: population proportion of interacting genetic factor in G*E fd[8]: frequency of dominant genes in G fdge[9]: frequency of dominant genes in G*E gnge[10]: number of recessive genes in G*E ppd[11]: population proportion of dominant genes in G*G ppr[12]: population proportion of recessive genes in G*G kd[13]: number of dominant genes in G*G kr[14]: number of recessive genes in G*G
References
L. Park, J. Kim, A novel approach for identifying causal models of complex disease from family data, Genetics, 2015 Apr; 199, 1007-1016.
Examples
### PLI=0.01.
ppt<-0.01
### a simple causal model with G and E components
pg<-0.007 # the proportion of G component in total populations
pgg<-0 # the proportion of G*G component in total populations
pge<-0 # the proportion of G*E component in total populations
e<-1-(1-ppt)/(1-pg) # the proportion of E component in total populations
fd<-0.001 # one dominant gene
tt<-3 # the number of recessive genes
temp<-sqrt(1-((1-pg)/(1-fd)^2)^(1/tt))
fr<-c(array(0,length(fd)),array(temp,tt))
fd<-c(fd,array(0,tt))
rp<-drgegggne(fd,fr,c(0,0),c(0,0),c(0,0),c(0,0),0,e)
sdata<-rp[,3]/(rp[,2]+rp[,3])
#sdata<-round(sdata*500)
darray<-c(1:2,4:6)
## available data= MZT, P-O, sibs, grandparent-grandchild, avuncular pair
n<-array(1000,length(darray))
x<-array()
for(i in 1:length(darray)){
x[i]<-rbinom(1,n[i],sdata[darray[i]])
}
model<-c(1,1,0,0)
## remove # from the following lines to test examples.
#iter.mcmc(ppt,2,15,2,1,1,darray,x,n,model) # provide a running test
#iter.mcmc(ppt,2,2000,2,10,500,darray,x,n,model) # provide a proper result
Likelihood Ratio Tests for Identifying Number of Functional Polymorphisms
Description
Compute p-values and likelihoods of all possible models for a given number of functional SNP(s).
Usage
lrt(n.fp, n, x, geno, no.con=nrow(geno))
Arguments
n.fp |
number of functional SNPs for tests. |
n |
array of each total number of case sample chromosomes for SNPs |
x |
array of each total allele number in case samples |
geno |
matrix of alleles, such that each locus has a pair of adjacent columns of alleles, and the order of columns corresponds to the order of loci on a chromosome. If there are K loci, then ncol(geno) = 2*K. Rows represent the alleles for each subject. Each allele shoud be represented as numbers (A=1,C=2,G=3,T=4). |
no.con |
number of control chromosomes. |
Value
matrix of likelihood ratio test results. First n.fp rows indicate the model for each set of disease polymorphisms, and followed by p-values, -2 log(likelihood ratio) with corrections for variances, maximum likelihood ratio estimates, and likelihood.
References
L. Park, Identifying disease polymorphisms from case-control genetic association data, Genetica, 2010 138 (11-12), 1147-1159.
See Also
allele.freq hap.freq
Examples
## LRT tests when SNP1 & SNP6 are the functional polymorphisms.
data(apoe)
n<-c(2000, 2000, 2000, 2000, 2000, 2000, 2000) #case sample size = 1000
x<-c(1707, 281,1341, 435, 772, 416, 1797) #allele numbers in case samples
Z<-2 #number of functional SNPs for tests
n.poly<-ncol(apoe7)/2 #total number of SNPs
#control sample generation( sample size = 1000 )
con.samp<-sample(nrow(apoe7),1000,replace=TRUE)
con.data<-array()
for (i in con.samp){
con.data<-rbind(con.data,apoe7[i,])
}
con.data<-con.data[2:1001,]
lrt(1,n,x,con.data)
lrt(2,n,x,con.data)
Likelihood Ratio Tests for Identifying Disease Polymorphisms with Same Effects
Description
Compute p-values and likelihoods of all possible models for a given number of disease SNP(s).
Usage
lrtG(n.fp, genoT, genoC)
Arguments
n.fp |
number of disease SNPs for tests. |
genoT |
matrix of control genotypes. Each row indicates a variant, and each column indicates a haplotype of an individual. Two alleles of 0 and 1 are allowed. |
genoC |
matrix of case genotypes. Each row indicates a variant, and each column indicates a haplotype of an individual. Two alleles of 0 and 1 are allowed. |
Value
matrix of likelihood ratio test results. First row indicates the index, and following n.fp rows indicate the model for each set of disease polymorphisms, and followed by p-values, -2 log(likelihood ratio) with corrections for variances, and the degree of freedom.
References
L. Park, J. Kim, Rare high-impact disease variants: properties and identification, Genetics Research, 2016 Mar; 98, e6.
See Also
allele.freq.G
Examples
## LRT tests for a dominant variant (15th variant)
## the odds ratio: 3, control: 100, case: 100.
data(apoeG)
lrtG(1,genoT[,1:20],genoC[,1:20])
# use "lrtG(1,genoT,genoC)" for the actual test.