Type: | Package |
Title: | N-Gram Analysis of Biological Sequences |
Version: | 1.6.3 |
LazyData: | true |
Date: | 2020-03-31 |
Description: | Tools for extraction and analysis of various n-grams (k-mers) derived from biological sequences (proteins or nucleic acids). Contains QuiPT (quick permutation test) for fast feature-filtering of the n-gram data. |
License: | GPL-3 |
URL: | https://github.com/michbur/biogram |
BugReports: | https://github.com/michbur/biogram/issues |
VignetteBuilder: | knitr |
Depends: | R (≥ 3.0.0), slam |
Imports: | combinat, entropy, partitions |
Suggests: | ggplot2, knitr, testthat |
NeedsCompilation: | no |
Repository: | CRAN |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.0 |
Packaged: | 2020-03-31 13:55:56 UTC; michal |
Author: | Michal Burdukiewicz
|
Maintainer: | Michal Burdukiewicz <michalburdukiewicz@gmail.com> |
Date/Publication: | 2020-03-31 14:30:06 UTC |
biogram - analysis of biological sequences using n-grams
Description
biogram
package is a toolbox for the analysis of
nucleic acid and protein sequences using n-grams. Possible applications include
motif discovery, feature selection, clustering, and classification.
n-grams
n-grams (k-tuples) are sets of n
characters derived from the input sequence(s).
They may form continuous sub-sequences or be discontinuous. For example, from the
sequence of nucleotides AATA
one can extract the following continuous
2-grams (bigrams): AA
, AT
and TA
. Moreover, there are two
possible bigrams separated by a single space: A_T
and A_A
, and one
bigram separated by two spaces: A__A
.
Another important n-gram parameter is its position. Instead of just counting n-grams,
one may want to count how many n-grams occur at a given position in multiple (e.g. related)
sequences. For example, in the sequences AATA
and AACA
there is only one
bigram at position 1: AA
, but there are two bigrams at position two: AT
and
AC
. The following notation is used for position-specific n-grams: 1_AA
,
2_AT
, 2_AC
.
In the biogram
package, the count_ngrams
function is used for
counting and extracting n-grams. Using the d
argument the user can specify the
distance between elements of the n-grams. The pos
argument can be used to enable
position specificity.
n-gram data dimensionality
We note that n-grams suffer from the curse of dimensionality. For example, for a peptide
of length 6 20^{n}
n-grams and 6 \times 20^{n}
positioned n-grams are possible.
Data sets of such an enormous size are hard to manage and analyze in R.
The biogram
package deals with both of the abovementioned problems. It uses
innate properties of the n-gram data which usually can be represented by sparse
matrices. Data storage is done using functionalities from the slam
package. To ease
the selection of significant features, biogram
provides the user with QuiPT,
a very fast permutation test for binary data (see test_features
).
Another way of reducing dimensionality is the aggregation of sequence residues into more
general groups. For example, all positively-charged amino acids may be aggregated into
one group. This action can be performed using the degenerate
function.
Encoding of amino acids can easu sequence analysis, but multidimensional
objects as the aggregations of amino acids are not easily comparable. We introduced the
encoding distance, a measure defining the distance between encodings. It can be computed
using the calc_ed
function.
Author(s)
Michal Burdukiewicz, Piotr Sobczyk, Chris Lauber
Examples
# use data set from package
data(human_cleave)
# first nine columns represent subsequent nine amino acids from cleavage sites
# degenerate the sequence to reduce the dimensionality of the problem
# (use five groups instead of 20 amino acids)
deg_seqs <- degenerate(human_cleave[, 1L:9],
list(`a` = c(1, 6, 8, 10, 11, 18),
`b` = c(2, 13, 14, 16, 17),
`c` = c(5, 19, 20),
`d` = c(7, 9, 12, 15),
'e' = c(3, 4)))
# EXAMPLE 1 - extract significant trigrams
# extract trigrams
trigrams <- count_ngrams(deg_seqs, 3, letters[1L:5], pos = TRUE)
# select features that differ between the two target groups using QuiPT
test1 <- test_features(human_cleave[, "tar"], trigrams)
# see a summary of the results
summary(test1)
# aggregate features in groups based on their p-value
gr <- cut(test1)
# get position map of the most significant n-grams
position_ngrams(gr[[1]])
# transform the most significant n-grams to more readable form
decode_ngrams(gr[[1]])
# EXAMPLE 2 - search for specific n-grams
# the n-grams of the interest are a_a (a-gap-a) and e_e (e-gap-e) on the
# 3rd and 4th position
# firstly code n-grams in biogram notation and add position information
coded <- code_ngrams(c("a_a", "c_c"))
# add position information
coded <- c(paste0("3_", coded), paste0("4_", coded))
# count only the features of the interest
bigrams <- count_specified(deg_seqs, coded)
# test which of the features of the interest is significant
test2 <- test_features(human_cleave[, "tar"], bigrams)
cut(test2)
Normalized amino acids properties
Description
Normalized (0-1) 554 amino acid properties as retreived from AAIndex database (release 9.1) enriched with contactivity of amino acids.
Format
A data frames with 20 columns and 600 rows.
Details
Following properties are included (AAIndex key: description of the property)
- ANDN920101
alpha-CH chemical shifts (Andersen et al., 1992)
- ARGP820101
Hydrophobicity index (Argos et al., 1982)
- ARGP820102
Signal sequence helical potential (Argos et al., 1982)
- ARGP820103
Membrane-buried preference parameters (Argos et al., 1982)
- BEGF750101
Conformational parameter of inner helix (Beghin-Dirkx, 1975)
- BEGF750102
Conformational parameter of beta-structure (Beghin-Dirkx, 1975)
- BEGF750103
Conformational parameter of beta-turn (Beghin-Dirkx, 1975)
- BHAR880101
Average flexibility indices (Bhaskaran-Ponnuswamy, 1988)
- BIGC670101
Residue volume (Bigelow, 1967)
- BIOV880101
Information value for accessibility; average fraction 35% (Biou et al., 1988)
- BIOV880102
Information value for accessibility; average fraction 23% (Biou et al., 1988)
- BROC820101
Retention coefficient in TFA (Browne et al., 1982)
- BROC820102
Retention coefficient in HFBA (Browne et al., 1982)
- BULH740101
Transfer free energy to surface (Bull-Breese, 1974)
- BULH740102
Apparent partial specific volume (Bull-Breese, 1974)
- BUNA790101
alpha-NH chemical shifts (Bundi-Wuthrich, 1979)
- BUNA790102
alpha-CH chemical shifts (Bundi-Wuthrich, 1979)
- BUNA790103
Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979)
- BURA740101
Normalized frequency of alpha-helix (Burgess et al., 1974)
- BURA740102
Normalized frequency of extended structure (Burgess et al., 1974)
- CHAM810101
Steric parameter (Charton, 1981)
- CHAM820101
Polarizability parameter (Charton-Charton, 1982)
- CHAM820102
Free energy of solution in water, kcal/mole (Charton-Charton, 1982)
- CHAM830101
The Chou-Fasman parameter of the coil conformation (Charton-Charton, 1983)
- CHAM830102
A parameter defined from the residuals obtained from the best correlation of the Chou-Fasman parameter of beta-sheet (Charton-Charton, 1983)
- CHAM830103
The number of atoms in the side chain labelled 1+1 (Charton-Charton, 1983)
- CHAM830104
The number of atoms in the side chain labelled 2+1 (Charton-Charton, 1983)
- CHAM830105
The number of atoms in the side chain labelled 3+1 (Charton-Charton, 1983)
- CHAM830106
The number of bonds in the longest chain (Charton-Charton, 1983)
- CHAM830107
A parameter of charge transfer capability (Charton-Charton, 1983)
- CHAM830108
A parameter of charge transfer donor capability (Charton-Charton, 1983)
- CHOC750101
Average volume of buried residue (Chothia, 1975)
- CHOC760101
Residue accessible surface area in tripeptide (Chothia, 1976)
- CHOC760102
Residue accessible surface area in folded protein (Chothia, 1976)
- CHOC760103
Proportion of residues 95% buried (Chothia, 1976)
- CHOC760104
Proportion of residues 100% buried (Chothia, 1976)
- CHOP780101
Normalized frequency of beta-turn (Chou-Fasman, 1978a)
- CHOP780201
Normalized frequency of alpha-helix (Chou-Fasman, 1978b)
- CHOP780202
Normalized frequency of beta-sheet (Chou-Fasman, 1978b)
- CHOP780203
Normalized frequency of beta-turn (Chou-Fasman, 1978b)
- CHOP780204
Normalized frequency of N-terminal helix (Chou-Fasman, 1978b)
- CHOP780205
Normalized frequency of C-terminal helix (Chou-Fasman, 1978b)
- CHOP780206
Normalized frequency of N-terminal non helical region (Chou-Fasman, 1978b)
- CHOP780207
Normalized frequency of C-terminal non helical region (Chou-Fasman, 1978b)
- CHOP780208
Normalized frequency of N-terminal beta-sheet (Chou-Fasman, 1978b)
- CHOP780209
Normalized frequency of C-terminal beta-sheet (Chou-Fasman, 1978b)
- CHOP780210
Normalized frequency of N-terminal non beta region (Chou-Fasman, 1978b)
- CHOP780211
Normalized frequency of C-terminal non beta region (Chou-Fasman, 1978b)
- CHOP780212
Frequency of the 1st residue in turn (Chou-Fasman, 1978b)
- CHOP780213
Frequency of the 2nd residue in turn (Chou-Fasman, 1978b)
- CHOP780214
Frequency of the 3rd residue in turn (Chou-Fasman, 1978b)
- CHOP780215
Frequency of the 4th residue in turn (Chou-Fasman, 1978b)
- CHOP780216
Normalized frequency of the 2nd and 3rd residues in turn (Chou-Fasman, 1978b)
- CIDH920101
Normalized hydrophobicity scales for alpha-proteins (Cid et al., 1992)
- CIDH920102
Normalized hydrophobicity scales for beta-proteins (Cid et al., 1992)
- CIDH920103
Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992)
- CIDH920104
Normalized hydrophobicity scales for alpha/beta-proteins (Cid et al., 1992)
- CIDH920105
Normalized average hydrophobicity scales (Cid et al., 1992)
- COHE430101
Partial specific volume (Cohn-Edsall, 1943)
- CRAJ730101
Normalized frequency of middle helix (Crawford et al., 1973)
- CRAJ730102
Normalized frequency of beta-sheet (Crawford et al., 1973)
- CRAJ730103
Normalized frequency of turn (Crawford et al., 1973)
- DAWD720101
Size (Dawson, 1972)
- DAYM780101
Amino acid composition (Dayhoff et al., 1978a)
- DAYM780201
Relative mutability (Dayhoff et al., 1978b)
- DESM900101
Membrane preference for cytochrome b: MPH89 (Degli Esposti et al., 1990)
- DESM900102
Average membrane preference: AMP07 (Degli Esposti et al., 1990)
- EISD840101
Consensus normalized hydrophobicity scale (Eisenberg, 1984)
- EISD860101
Solvation free energy (Eisenberg-McLachlan, 1986)
- EISD860102
Atom-based hydrophobic moment (Eisenberg-McLachlan, 1986)
- EISD860103
Direction of hydrophobic moment (Eisenberg-McLachlan, 1986)
- FASG760101
Molecular weight (Fasman, 1976)
- FASG760102
Melting point (Fasman, 1976)
- FASG760103
Optical rotation (Fasman, 1976)
- FASG760104
pK-N (Fasman, 1976)
- FASG760105
pK-C (Fasman, 1976)
- FAUJ830101
Hydrophobic parameter pi (Fauchere-Pliska, 1983)
- FAUJ880101
Graph shape index (Fauchere et al., 1988)
- FAUJ880102
Smoothed upsilon steric parameter (Fauchere et al., 1988)
- FAUJ880103
Normalized van der Waals volume (Fauchere et al., 1988)
- FAUJ880104
STERIMOL length of the side chain (Fauchere et al., 1988)
- FAUJ880105
STERIMOL minimum width of the side chain (Fauchere et al., 1988)
- FAUJ880106
STERIMOL maximum width of the side chain (Fauchere et al., 1988)
- FAUJ880107
N.m.r. chemical shift of alpha-carbon (Fauchere et al., 1988)
- FAUJ880108
Localized electrical effect (Fauchere et al., 1988)
- FAUJ880109
Number of hydrogen bond donors (Fauchere et al., 1988)
- FAUJ880110
Number of full nonbonding orbitals (Fauchere et al., 1988)
- FAUJ880111
Positive charge (Fauchere et al., 1988)
- FAUJ880112
Negative charge (Fauchere et al., 1988)
- FAUJ880113
pK-a(RCOOH) (Fauchere et al., 1988)
- FINA770101
Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977)
- FINA910101
Helix initiation parameter at posision i-1 (Finkelstein et al., 1991)
- FINA910102
Helix initiation parameter at posision i,i+1,i+2 (Finkelstein et al., 1991)
- FINA910103
Helix termination parameter at posision j-2,j-1,j (Finkelstein et al., 1991)
- FINA910104
Helix termination parameter at posision j+1 (Finkelstein et al., 1991)
- GARJ730101
Partition coefficient (Garel et al., 1973)
- GEIM800101
Alpha-helix indices (Geisow-Roberts, 1980)
- GEIM800102
Alpha-helix indices for alpha-proteins (Geisow-Roberts, 1980)
- GEIM800103
Alpha-helix indices for beta-proteins (Geisow-Roberts, 1980)
- GEIM800104
Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980)
- GEIM800105
Beta-strand indices (Geisow-Roberts, 1980)
- GEIM800106
Beta-strand indices for beta-proteins (Geisow-Roberts, 1980)
- GEIM800107
Beta-strand indices for alpha/beta-proteins (Geisow-Roberts, 1980)
- GEIM800108
Aperiodic indices (Geisow-Roberts, 1980)
- GEIM800109
Aperiodic indices for alpha-proteins (Geisow-Roberts, 1980)
- GEIM800110
Aperiodic indices for beta-proteins (Geisow-Roberts, 1980)
- GEIM800111
Aperiodic indices for alpha/beta-proteins (Geisow-Roberts, 1980)
- GOLD730101
Hydrophobicity factor (Goldsack-Chalifoux, 1973)
- GOLD730102
Residue volume (Goldsack-Chalifoux, 1973)
- GRAR740101
Composition (Grantham, 1974)
- GRAR740102
Polarity (Grantham, 1974)
- GRAR740103
Volume (Grantham, 1974)
- GUYH850101
Partition energy (Guy, 1985)
- HOPA770101
Hydration number (Hopfinger, 1971), Cited by Charton-Charton (1982)
- HOPT810101
Hydrophilicity value (Hopp-Woods, 1981)
- HUTJ700101
Heat capacity (Hutchens, 1970)
- HUTJ700102
Absolute entropy (Hutchens, 1970)
- HUTJ700103
Entropy of formation (Hutchens, 1970)
- ISOY800101
Normalized relative frequency of alpha-helix (Isogai et al., 1980)
- ISOY800102
Normalized relative frequency of extended structure (Isogai et al., 1980)
- ISOY800103
Normalized relative frequency of bend (Isogai et al., 1980)
- ISOY800104
Normalized relative frequency of bend R (Isogai et al., 1980)
- ISOY800105
Normalized relative frequency of bend S (Isogai et al., 1980)
- ISOY800106
Normalized relative frequency of helix end (Isogai et al., 1980)
- ISOY800107
Normalized relative frequency of double bend (Isogai et al., 1980)
- ISOY800108
Normalized relative frequency of coil (Isogai et al., 1980)
- JANJ780101
Average accessible surface area (Janin et al., 1978)
- JANJ780102
Percentage of buried residues (Janin et al., 1978)
- JANJ780103
Percentage of exposed residues (Janin et al., 1978)
- JANJ790101
Ratio of buried and accessible molar fractions (Janin, 1979)
- JANJ790102
Transfer free energy (Janin, 1979)
- JOND750101
Hydrophobicity (Jones, 1975)
- JOND750102
pK (-COOH) (Jones, 1975)
- JOND920101
Relative frequency of occurrence (Jones et al., 1992)
- JOND920102
Relative mutability (Jones et al., 1992)
- JUKT750101
Amino acid distribution (Jukes et al., 1975)
- JUNJ780101
Sequence frequency (Jungck, 1978)
- KANM800101
Average relative probability of helix (Kanehisa-Tsong, 1980)
- KANM800102
Average relative probability of beta-sheet (Kanehisa-Tsong, 1980)
- KANM800103
Average relative probability of inner helix (Kanehisa-Tsong, 1980)
- KANM800104
Average relative probability of inner beta-sheet (Kanehisa-Tsong, 1980)
- KARP850101
Flexibility parameter for no rigid neighbors (Karplus-Schulz, 1985)
- KARP850102
Flexibility parameter for one rigid neighbor (Karplus-Schulz, 1985)
- KARP850103
Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985)
- KHAG800101
The Kerr-constant increments (Khanarian-Moore, 1980)
- KLEP840101
Net charge (Klein et al., 1984)
- KRIW710101
Side chain interaction parameter (Krigbaum-Rubin, 1971)
- KRIW790101
Side chain interaction parameter (Krigbaum-Komoriya, 1979)
- KRIW790102
Fraction of site occupied by water (Krigbaum-Komoriya, 1979)
- KRIW790103
Side chain volume (Krigbaum-Komoriya, 1979)
- KYTJ820101
Hydropathy index (Kyte-Doolittle, 1982)
- LAWE840101
Transfer free energy, CHP/water (Lawson et al., 1984)
- LEVM760101
Hydrophobic parameter (Levitt, 1976)
- LEVM760102
Distance between C-alpha and centroid of side chain (Levitt, 1976)
- LEVM760103
Side chain angle theta(AAR) (Levitt, 1976)
- LEVM760104
Side chain torsion angle phi(AAAR) (Levitt, 1976)
- LEVM760105
Radius of gyration of side chain (Levitt, 1976)
- LEVM760106
van der Waals parameter R0 (Levitt, 1976)
- LEVM760107
van der Waals parameter epsilon (Levitt, 1976)
- LEVM780101
Normalized frequency of alpha-helix, with weights (Levitt, 1978)
- LEVM780102
Normalized frequency of beta-sheet, with weights (Levitt, 1978)
- LEVM780103
Normalized frequency of reverse turn, with weights (Levitt, 1978)
- LEVM780104
Normalized frequency of alpha-helix, unweighted (Levitt, 1978)
- LEVM780105
Normalized frequency of beta-sheet, unweighted (Levitt, 1978)
- LEVM780106
Normalized frequency of reverse turn, unweighted (Levitt, 1978)
- LEWP710101
Frequency of occurrence in beta-bends (Lewis et al., 1971)
- LIFS790101
Conformational preference for all beta-strands (Lifson-Sander, 1979)
- LIFS790102
Conformational preference for parallel beta-strands (Lifson-Sander, 1979)
- LIFS790103
Conformational preference for antiparallel beta-strands (Lifson-Sander, 1979)
- MANP780101
Average surrounding hydrophobicity (Manavalan-Ponnuswamy, 1978)
- MAXF760101
Normalized frequency of alpha-helix (Maxfield-Scheraga, 1976)
- MAXF760102
Normalized frequency of extended structure (Maxfield-Scheraga, 1976)
- MAXF760103
Normalized frequency of zeta R (Maxfield-Scheraga, 1976)
- MAXF760104
Normalized frequency of left-handed alpha-helix (Maxfield-Scheraga, 1976)
- MAXF760105
Normalized frequency of zeta L (Maxfield-Scheraga, 1976)
- MAXF760106
Normalized frequency of alpha region (Maxfield-Scheraga, 1976)
- MCMT640101
Refractivity (McMeekin et al., 1964), Cited by Jones (1975)
- MEEJ800101
Retention coefficient in HPLC, pH7.4 (Meek, 1980)
- MEEJ800102
Retention coefficient in HPLC, pH2.1 (Meek, 1980)
- MEEJ810101
Retention coefficient in NaClO4 (Meek-Rossetti, 1981)
- MEEJ810102
Retention coefficient in NaH2PO4 (Meek-Rossetti, 1981)
- MEIH800101
Average reduced distance for C-alpha (Meirovitch et al., 1980)
- MEIH800102
Average reduced distance for side chain (Meirovitch et al., 1980)
- MEIH800103
Average side chain orientation angle (Meirovitch et al., 1980)
- MIYS850101
Effective partition energy (Miyazawa-Jernigan, 1985)
- NAGK730101
Normalized frequency of alpha-helix (Nagano, 1973)
- NAGK730102
Normalized frequency of bata-structure (Nagano, 1973)
- NAGK730103
Normalized frequency of coil (Nagano, 1973)
- NAKH900101
AA composition of total proteins (Nakashima et al., 1990)
- NAKH900102
SD of AA composition of total proteins (Nakashima et al., 1990)
- NAKH900103
AA composition of mt-proteins (Nakashima et al., 1990)
- NAKH900104
Normalized composition of mt-proteins (Nakashima et al., 1990)
- NAKH900105
AA composition of mt-proteins from animal (Nakashima et al., 1990)
- NAKH900106
Normalized composition from animal (Nakashima et al., 1990)
- NAKH900107
AA composition of mt-proteins from fungi and plant (Nakashima et al., 1990)
- NAKH900108
Normalized composition from fungi and plant (Nakashima et al., 1990)
- NAKH900109
AA composition of membrane proteins (Nakashima et al., 1990)
- NAKH900110
Normalized composition of membrane proteins (Nakashima et al., 1990)
- NAKH900111
Transmembrane regions of non-mt-proteins (Nakashima et al., 1990)
- NAKH900112
Transmembrane regions of mt-proteins (Nakashima et al., 1990)
- NAKH900113
Ratio of average and computed composition (Nakashima et al., 1990)
- NAKH920101
AA composition of CYT of single-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920102
AA composition of CYT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920103
AA composition of EXT of single-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920104
AA composition of EXT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920105
AA composition of MEM of single-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920106
AA composition of CYT of multi-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920107
AA composition of EXT of multi-spanning proteins (Nakashima-Nishikawa, 1992)
- NAKH920108
AA composition of MEM of multi-spanning proteins (Nakashima-Nishikawa, 1992)
- NISK800101
8 A contact number (Nishikawa-Ooi, 1980)
- NISK860101
14 A contact number (Nishikawa-Ooi, 1986)
- NOZY710101
Transfer energy, organic solvent/water (Nozaki-Tanford, 1971)
- OOBM770101
Average non-bonded energy per atom (Oobatake-Ooi, 1977)
- OOBM770102
Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977)
- OOBM770103
Long range non-bonded energy per atom (Oobatake-Ooi, 1977)
- OOBM770104
Average non-bonded energy per residue (Oobatake-Ooi, 1977)
- OOBM770105
Short and medium range non-bonded energy per residue (Oobatake-Ooi, 1977)
- OOBM850101
Optimized beta-structure-coil equilibrium constant (Oobatake et al., 1985)
- OOBM850102
Optimized propensity to form reverse turn (Oobatake et al., 1985)
- OOBM850103
Optimized transfer energy parameter (Oobatake et al., 1985)
- OOBM850104
Optimized average non-bonded energy per atom (Oobatake et al., 1985)
- OOBM850105
Optimized side chain interaction parameter (Oobatake et al., 1985)
- PALJ810101
Normalized frequency of alpha-helix from LG (Palau et al., 1981)
- PALJ810102
Normalized frequency of alpha-helix from CF (Palau et al., 1981)
- PALJ810103
Normalized frequency of beta-sheet from LG (Palau et al., 1981)
- PALJ810104
Normalized frequency of beta-sheet from CF (Palau et al., 1981)
- PALJ810105
Normalized frequency of turn from LG (Palau et al., 1981)
- PALJ810106
Normalized frequency of turn from CF (Palau et al., 1981)
- PALJ810107
Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981)
- PALJ810108
Normalized frequency of alpha-helix in alpha+beta class (Palau et al., 1981)
- PALJ810109
Normalized frequency of alpha-helix in alpha/beta class (Palau et al., 1981)
- PALJ810110
Normalized frequency of beta-sheet in all-beta class (Palau et al., 1981)
- PALJ810111
Normalized frequency of beta-sheet in alpha+beta class (Palau et al., 1981)
- PALJ810112
Normalized frequency of beta-sheet in alpha/beta class (Palau et al., 1981)
- PALJ810113
Normalized frequency of turn in all-alpha class (Palau et al., 1981)
- PALJ810114
Normalized frequency of turn in all-beta class (Palau et al., 1981)
- PALJ810115
Normalized frequency of turn in alpha+beta class (Palau et al., 1981)
- PALJ810116
Normalized frequency of turn in alpha/beta class (Palau et al., 1981)
- PARJ860101
HPLC parameter (Parker et al., 1986)
- PLIV810101
Partition coefficient (Pliska et al., 1981)
- PONP800101
Surrounding hydrophobicity in folded form (Ponnuswamy et al., 1980)
- PONP800102
Average gain in surrounding hydrophobicity (Ponnuswamy et al., 1980)
- PONP800103
Average gain ratio in surrounding hydrophobicity (Ponnuswamy et al., 1980)
- PONP800104
Surrounding hydrophobicity in alpha-helix (Ponnuswamy et al., 1980)
- PONP800105
Surrounding hydrophobicity in beta-sheet (Ponnuswamy et al., 1980)
- PONP800106
Surrounding hydrophobicity in turn (Ponnuswamy et al., 1980)
- PONP800107
Accessibility reduction ratio (Ponnuswamy et al., 1980)
- PONP800108
Average number of surrounding residues (Ponnuswamy et al., 1980)
- PRAM820101
Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982)
- PRAM820102
Slope in regression analysis x 1.0E1 (Prabhakaran-Ponnuswamy, 1982)
- PRAM820103
Correlation coefficient in regression analysis (Prabhakaran-Ponnuswamy, 1982)
- PRAM900101
Hydrophobicity (Prabhakaran, 1990)
- PRAM900102
Relative frequency in alpha-helix (Prabhakaran, 1990)
- PRAM900103
Relative frequency in beta-sheet (Prabhakaran, 1990)
- PRAM900104
Relative frequency in reverse-turn (Prabhakaran, 1990)
- PTIO830101
Helix-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)
- PTIO830102
Beta-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)
- QIAN880101
Weights for alpha-helix at the window position of -6 (Qian-Sejnowski, 1988)
- QIAN880102
Weights for alpha-helix at the window position of -5 (Qian-Sejnowski, 1988)
- QIAN880103
Weights for alpha-helix at the window position of -4 (Qian-Sejnowski, 1988)
- QIAN880104
Weights for alpha-helix at the window position of -3 (Qian-Sejnowski, 1988)
- QIAN880105
Weights for alpha-helix at the window position of -2 (Qian-Sejnowski, 1988)
- QIAN880106
Weights for alpha-helix at the window position of -1 (Qian-Sejnowski, 1988)
- QIAN880107
Weights for alpha-helix at the window position of 0 (Qian-Sejnowski, 1988)
- QIAN880108
Weights for alpha-helix at the window position of 1 (Qian-Sejnowski, 1988)
- QIAN880109
Weights for alpha-helix at the window position of 2 (Qian-Sejnowski, 1988)
- QIAN880110
Weights for alpha-helix at the window position of 3 (Qian-Sejnowski, 1988)
- QIAN880111
Weights for alpha-helix at the window position of 4 (Qian-Sejnowski, 1988)
- QIAN880112
Weights for alpha-helix at the window position of 5 (Qian-Sejnowski, 1988)
- QIAN880113
Weights for alpha-helix at the window position of 6 (Qian-Sejnowski, 1988)
- QIAN880114
Weights for beta-sheet at the window position of -6 (Qian-Sejnowski, 1988)
- QIAN880115
Weights for beta-sheet at the window position of -5 (Qian-Sejnowski, 1988)
- QIAN880116
Weights for beta-sheet at the window position of -4 (Qian-Sejnowski, 1988)
- QIAN880117
Weights for beta-sheet at the window position of -3 (Qian-Sejnowski, 1988)
- QIAN880118
Weights for beta-sheet at the window position of -2 (Qian-Sejnowski, 1988)
- QIAN880119
Weights for beta-sheet at the window position of -1 (Qian-Sejnowski, 1988)
- QIAN880120
Weights for beta-sheet at the window position of 0 (Qian-Sejnowski, 1988)
- QIAN880121
Weights for beta-sheet at the window position of 1 (Qian-Sejnowski, 1988)
- QIAN880122
Weights for beta-sheet at the window position of 2 (Qian-Sejnowski, 1988)
- QIAN880123
Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988)
- QIAN880124
Weights for beta-sheet at the window position of 4 (Qian-Sejnowski, 1988)
- QIAN880125
Weights for beta-sheet at the window position of 5 (Qian-Sejnowski, 1988)
- QIAN880126
Weights for beta-sheet at the window position of 6 (Qian-Sejnowski, 1988)
- QIAN880127
Weights for coil at the window position of -6 (Qian-Sejnowski, 1988)
- QIAN880128
Weights for coil at the window position of -5 (Qian-Sejnowski, 1988)
- QIAN880129
Weights for coil at the window position of -4 (Qian-Sejnowski, 1988)
- QIAN880130
Weights for coil at the window position of -3 (Qian-Sejnowski, 1988)
- QIAN880131
Weights for coil at the window position of -2 (Qian-Sejnowski, 1988)
- QIAN880132
Weights for coil at the window position of -1 (Qian-Sejnowski, 1988)
- QIAN880133
Weights for coil at the window position of 0 (Qian-Sejnowski, 1988)
- QIAN880134
Weights for coil at the window position of 1 (Qian-Sejnowski, 1988)
- QIAN880135
Weights for coil at the window position of 2 (Qian-Sejnowski, 1988)
- QIAN880136
Weights for coil at the window position of 3 (Qian-Sejnowski, 1988)
- QIAN880137
Weights for coil at the window position of 4 (Qian-Sejnowski, 1988)
- QIAN880138
Weights for coil at the window position of 5 (Qian-Sejnowski, 1988)
- QIAN880139
Weights for coil at the window position of 6 (Qian-Sejnowski, 1988)
- RACS770101
Average reduced distance for C-alpha (Rackovsky-Scheraga, 1977)
- RACS770102
Average reduced distance for side chain (Rackovsky-Scheraga, 1977)
- RACS770103
Side chain orientational preference (Rackovsky-Scheraga, 1977)
- RACS820101
Average relative fractional occurrence in A0(i) (Rackovsky-Scheraga, 1982)
- RACS820102
Average relative fractional occurrence in AR(i) (Rackovsky-Scheraga, 1982)
- RACS820103
Average relative fractional occurrence in AL(i) (Rackovsky-Scheraga, 1982)
- RACS820104
Average relative fractional occurrence in EL(i) (Rackovsky-Scheraga, 1982)
- RACS820105
Average relative fractional occurrence in E0(i) (Rackovsky-Scheraga, 1982)
- RACS820106
Average relative fractional occurrence in ER(i) (Rackovsky-Scheraga, 1982)
- RACS820107
Average relative fractional occurrence in A0(i-1) (Rackovsky-Scheraga, 1982)
- RACS820108
Average relative fractional occurrence in AR(i-1) (Rackovsky-Scheraga, 1982)
- RACS820109
Average relative fractional occurrence in AL(i-1) (Rackovsky-Scheraga, 1982)
- RACS820110
Average relative fractional occurrence in EL(i-1) (Rackovsky-Scheraga, 1982)
- RACS820111
Average relative fractional occurrence in E0(i-1) (Rackovsky-Scheraga, 1982)
- RACS820112
Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982)
- RACS820113
Value of theta(i) (Rackovsky-Scheraga, 1982)
- RACS820114
Value of theta(i-1) (Rackovsky-Scheraga, 1982)
- RADA880101
Transfer free energy from chx to wat (Radzicka-Wolfenden, 1988)
- RADA880102
Transfer free energy from oct to wat (Radzicka-Wolfenden, 1988)
- RADA880103
Transfer free energy from vap to chx (Radzicka-Wolfenden, 1988)
- RADA880104
Transfer free energy from chx to oct (Radzicka-Wolfenden, 1988)
- RADA880105
Transfer free energy from vap to oct (Radzicka-Wolfenden, 1988)
- RADA880106
Accessible surface area (Radzicka-Wolfenden, 1988)
- RADA880107
Energy transfer from out to in(95%buried) (Radzicka-Wolfenden, 1988)
- RADA880108
Mean polarity (Radzicka-Wolfenden, 1988)
- RICJ880101
Relative preference value at N" (Richardson-Richardson, 1988)
- RICJ880102
Relative preference value at N' (Richardson-Richardson, 1988)
- RICJ880103
Relative preference value at N-cap (Richardson-Richardson, 1988)
- RICJ880104
Relative preference value at N1 (Richardson-Richardson, 1988)
- RICJ880105
Relative preference value at N2 (Richardson-Richardson, 1988)
- RICJ880106
Relative preference value at N3 (Richardson-Richardson, 1988)
- RICJ880107
Relative preference value at N4 (Richardson-Richardson, 1988)
- RICJ880108
Relative preference value at N5 (Richardson-Richardson, 1988)
- RICJ880109
Relative preference value at Mid (Richardson-Richardson, 1988)
- RICJ880110
Relative preference value at C5 (Richardson-Richardson, 1988)
- RICJ880111
Relative preference value at C4 (Richardson-Richardson, 1988)
- RICJ880112
Relative preference value at C3 (Richardson-Richardson, 1988)
- RICJ880113
Relative preference value at C2 (Richardson-Richardson, 1988)
- RICJ880114
Relative preference value at C1 (Richardson-Richardson, 1988)
- RICJ880115
Relative preference value at C-cap (Richardson-Richardson, 1988)
- RICJ880116
Relative preference value at C' (Richardson-Richardson, 1988)
- RICJ880117
Relative preference value at C" (Richardson-Richardson, 1988)
- ROBB760101
Information measure for alpha-helix (Robson-Suzuki, 1976)
- ROBB760102
Information measure for N-terminal helix (Robson-Suzuki, 1976)
- ROBB760103
Information measure for middle helix (Robson-Suzuki, 1976)
- ROBB760104
Information measure for C-terminal helix (Robson-Suzuki, 1976)
- ROBB760105
Information measure for extended (Robson-Suzuki, 1976)
- ROBB760106
Information measure for pleated-sheet (Robson-Suzuki, 1976)
- ROBB760107
Information measure for extended without H-bond (Robson-Suzuki, 1976)
- ROBB760108
Information measure for turn (Robson-Suzuki, 1976)
- ROBB760109
Information measure for N-terminal turn (Robson-Suzuki, 1976)
- ROBB760110
Information measure for middle turn (Robson-Suzuki, 1976)
- ROBB760111
Information measure for C-terminal turn (Robson-Suzuki, 1976)
- ROBB760112
Information measure for coil (Robson-Suzuki, 1976)
- ROBB760113
Information measure for loop (Robson-Suzuki, 1976)
- ROBB790101
Hydration free energy (Robson-Osguthorpe, 1979)
- ROSG850101
Mean area buried on transfer (Rose et al., 1985)
- ROSG850102
Mean fractional area loss (Rose et al., 1985)
- ROSM880101
Side chain hydropathy, uncorrected for solvation (Roseman, 1988)
- ROSM880102
Side chain hydropathy, corrected for solvation (Roseman, 1988)
- ROSM880103
Loss of Side chain hydropathy by helix formation (Roseman, 1988)
- SIMZ760101
Transfer free energy (Simon, 1976), Cited by Charton-Charton (1982)
- SNEP660101
Principal component I (Sneath, 1966)
- SNEP660102
Principal component II (Sneath, 1966)
- SNEP660103
Principal component III (Sneath, 1966)
- SNEP660104
Principal component IV (Sneath, 1966)
- SUEM840101
Zimm-Bragg parameter s at 20 C (Sueki et al., 1984)
- SUEM840102
Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984)
- SWER830101
Optimal matching hydrophobicity (Sweet-Eisenberg, 1983)
- TANS770101
Normalized frequency of alpha-helix (Tanaka-Scheraga, 1977)
- TANS770102
Normalized frequency of isolated helix (Tanaka-Scheraga, 1977)
- TANS770103
Normalized frequency of extended structure (Tanaka-Scheraga, 1977)
- TANS770104
Normalized frequency of chain reversal R (Tanaka-Scheraga, 1977)
- TANS770105
Normalized frequency of chain reversal S (Tanaka-Scheraga, 1977)
- TANS770106
Normalized frequency of chain reversal D (Tanaka-Scheraga, 1977)
- TANS770107
Normalized frequency of left-handed helix (Tanaka-Scheraga, 1977)
- TANS770108
Normalized frequency of zeta R (Tanaka-Scheraga, 1977)
- TANS770109
Normalized frequency of coil (Tanaka-Scheraga, 1977)
- TANS770110
Normalized frequency of chain reversal (Tanaka-Scheraga, 1977)
- VASM830101
Relative population of conformational state A (Vasquez et al., 1983)
- VASM830102
Relative population of conformational state C (Vasquez et al., 1983)
- VASM830103
Relative population of conformational state E (Vasquez et al., 1983)
- VELV850101
Electron-ion interaction potential (Veljkovic et al., 1985)
- VENT840101
Bitterness (Venanzi, 1984)
- VHEG790101
Transfer free energy to lipophilic phase (von Heijne-Blomberg, 1979)
- WARP780101
Average interactions per side chain atom (Warme-Morgan, 1978)
- WEBA780101
RF value in high salt chromatography (Weber-Lacey, 1978)
- WERD780101
Propensity to be buried inside (Wertz-Scheraga, 1978)
- WERD780102
Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978)
- WERD780103
Free energy change of alpha(Ri) to alpha(Rh) (Wertz-Scheraga, 1978)
- WERD780104
Free energy change of epsilon(i) to alpha(Rh) (Wertz-Scheraga, 1978)
- WOEC730101
Polar requirement (Woese, 1973)
- WOLR810101
Hydration potential (Wolfenden et al., 1981)
- WOLS870101
Principal property value z1 (Wold et al., 1987)
- WOLS870102
Principal property value z2 (Wold et al., 1987)
- WOLS870103
Principal property value z3 (Wold et al., 1987)
- YUTK870101
Unfolding Gibbs energy in water, pH7.0 (Yutani et al., 1987)
- YUTK870102
Unfolding Gibbs energy in water, pH9.0 (Yutani et al., 1987)
- YUTK870103
Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987)
- YUTK870104
Activation Gibbs energy of unfolding, pH9.0 (Yutani et al., 1987)
- ZASB820101
Dependence of partition coefficient on ionic strength (Zaslavsky et al., 1982)
- ZIMJ680101
Hydrophobicity (Zimmerman et al., 1968)
- ZIMJ680102
Bulkiness (Zimmerman et al., 1968)
- ZIMJ680103
Polarity (Zimmerman et al., 1968)
- ZIMJ680104
Isoelectric point (Zimmerman et al., 1968)
- ZIMJ680105
RF rank (Zimmerman et al., 1968)
- AURR980101
Normalized positional residue frequency at helix termini N4'(Aurora-Rose, 1998)
- AURR980102
Normalized positional residue frequency at helix termini N"' (Aurora-Rose, 1998)
- AURR980103
Normalized positional residue frequency at helix termini N" (Aurora-Rose, 1998)
- AURR980104
Normalized positional residue frequency at helix termini N'(Aurora-Rose, 1998)
- AURR980105
Normalized positional residue frequency at helix termini Nc (Aurora-Rose, 1998)
- AURR980106
Normalized positional residue frequency at helix termini N1 (Aurora-Rose, 1998)
- AURR980107
Normalized positional residue frequency at helix termini N2 (Aurora-Rose, 1998)
- AURR980108
Normalized positional residue frequency at helix termini N3 (Aurora-Rose, 1998)
- AURR980109
Normalized positional residue frequency at helix termini N4 (Aurora-Rose, 1998)
- AURR980110
Normalized positional residue frequency at helix termini N5 (Aurora-Rose, 1998)
- AURR980111
Normalized positional residue frequency at helix termini C5 (Aurora-Rose, 1998)
- AURR980112
Normalized positional residue frequency at helix termini C4 (Aurora-Rose, 1998)
- AURR980113
Normalized positional residue frequency at helix termini C3 (Aurora-Rose, 1998)
- AURR980114
Normalized positional residue frequency at helix termini C2 (Aurora-Rose, 1998)
- AURR980115
Normalized positional residue frequency at helix termini C1 (Aurora-Rose, 1998)
- AURR980116
Normalized positional residue frequency at helix termini Cc (Aurora-Rose, 1998)
- AURR980117
Normalized positional residue frequency at helix termini C' (Aurora-Rose, 1998)
- AURR980118
Normalized positional residue frequency at helix termini C" (Aurora-Rose, 1998)
- AURR980119
Normalized positional residue frequency at helix termini C"' (Aurora-Rose, 1998)
- AURR980120
Normalized positional residue frequency at helix termini C4' (Aurora-Rose, 1998)
- ONEK900101
Delta G values for the peptides extrapolated to 0 M urea (O'Neil-DeGrado, 1990)
- ONEK900102
Helix formation parameters (delta delta G) (O'Neil-DeGrado, 1990)
- VINM940101
Normalized flexibility parameters (B-values), average (Vihinen et al., 1994)
- VINM940102
Normalized flexibility parameters (B-values) for each residue surrounded by none rigid neighbours (Vihinen et al., 1994)
- VINM940103
Normalized flexibility parameters (B-values) for each residue surrounded by one rigid neighbours (Vihinen et al., 1994)
- VINM940104
Normalized flexibility parameters (B-values) for each residue surrounded by two rigid neighbours (Vihinen et al., 1994)
- MUNV940101
Free energy in alpha-helical conformation (Munoz-Serrano, 1994)
- MUNV940102
Free energy in alpha-helical region (Munoz-Serrano, 1994)
- MUNV940103
Free energy in beta-strand conformation (Munoz-Serrano, 1994)
- MUNV940104
Free energy in beta-strand region (Munoz-Serrano, 1994)
- MUNV940105
Free energy in beta-strand region (Munoz-Serrano, 1994)
- WIMW960101
Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996)
- KIMC930101
Thermodynamic beta sheet propensity (Kim-Berg, 1993)
- MONM990101
Turn propensity scale for transmembrane helices (Monne et al., 1999)
- BLAM930101
Alpha helix propensity of position 44 in T4 lysozyme (Blaber et al., 1993)
- PARS000101
p-Values of mesophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)
- PARS000102
p-Values of thermophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)
- KUMS000101
Distribution of amino acid residues in the 18 non-redundant families of thermophilic proteins (Kumar et al., 2000)
- KUMS000102
Distribution of amino acid residues in the 18 non-redundant families of mesophilic proteins (Kumar et al., 2000)
- KUMS000103
Distribution of amino acid residues in the alpha-helices in thermophilic proteins (Kumar et al., 2000)
- KUMS000104
Distribution of amino acid residues in the alpha-helices in mesophilic proteins (Kumar et al., 2000)
- TAKK010101
Side-chain contribution to protein stability (kJ/mol) (Takano-Yutani, 2001)
- FODM020101
Propensity of amino acids within pi-helices (Fodje-Al-Karadaghi, 2002)
- NADH010101
Hydropathy scale based on self-information values in the two-state model (5% accessibility) (Naderi-Manesh et al., 2001)
- NADH010102
Hydropathy scale based on self-information values in the two-state model (9% accessibility) (Naderi-Manesh et al., 2001)
- NADH010103
Hydropathy scale based on self-information values in the two-state model (16% accessibility) (Naderi-Manesh et al., 2001)
- NADH010104
Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001)
- NADH010105
Hydropathy scale based on self-information values in the two-state model (25% accessibility) (Naderi-Manesh et al., 2001)
- NADH010106
Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001)
- NADH010107
Hydropathy scale based on self-information values in the two-state model (50% accessibility) (Naderi-Manesh et al., 2001)
- MONM990201
Averaged turn propensities in a transmembrane helix (Monne et al., 1999)
- KOEP990101
Alpha-helix propensity derived from designed sequences (Koehl-Levitt, 1999)
- KOEP990102
Beta-sheet propensity derived from designed sequences (Koehl-Levitt, 1999)
- CEDJ970101
Composition of amino acids in extracellular proteins (percent) (Cedano et al., 1997)
- CEDJ970102
Composition of amino acids in anchored proteins (percent) (Cedano et al., 1997)
- CEDJ970103
Composition of amino acids in membrane proteins (percent) (Cedano et al., 1997)
- CEDJ970104
Composition of amino acids in intracellular proteins (percent) (Cedano et al., 1997)
- CEDJ970105
Composition of amino acids in nuclear proteins (percent) (Cedano et al., 1997)
- FUKS010101
Surface composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010102
Surface composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010103
Surface composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010104
Surface composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010105
Interior composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010106
Interior composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010107
Interior composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010108
Interior composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010109
Entire chain composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010110
Entire chain composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010111
Entire chain composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
- FUKS010112
Entire chain compositino of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
- AVBF000101
Screening coefficients gamma, local (Avbelj, 2000)
- AVBF000102
Screening coefficients gamma, non-local (Avbelj, 2000)
- AVBF000103
Slopes tripeptide, FDPB VFF neutral (Avbelj, 2000)
- AVBF000104
Slopes tripeptides, LD VFF neutral (Avbelj, 2000)
- AVBF000105
Slopes tripeptide, FDPB VFF noside (Avbelj, 2000)
- AVBF000106
Slopes tripeptide FDPB VFF all (Avbelj, 2000)
- AVBF000107
Slopes tripeptide FDPB PARSE neutral (Avbelj, 2000)
- AVBF000108
Slopes dekapeptide, FDPB VFF neutral (Avbelj, 2000)
- AVBF000109
Slopes proteins, FDPB VFF neutral (Avbelj, 2000)
- YANJ020101
Side-chain conformation by gaussian evolutionary method (Yang et al., 2002)
- MITS020101
Amphiphilicity index (Mitaku et al., 2002)
- TSAJ990101
Volumes including the crystallographic waters using the ProtOr (Tsai et al., 1999)
- TSAJ990102
Volumes not including the crystallographic waters using the ProtOr (Tsai et al., 1999)
- COSI940101
Electron-ion interaction potential values (Cosic, 1994)
- PONP930101
Hydrophobicity scales (Ponnuswamy, 1993)
- WILM950101
Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
- WILM950102
Hydrophobicity coefficient in RP-HPLC, C8 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
- WILM950103
Hydrophobicity coefficient in RP-HPLC, C4 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
- WILM950104
Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O (Wilce et al. 1995)
- KUHL950101
Hydrophilicity scale (Kuhn et al., 1995)
- GUOD860101
Retention coefficient at pH 2 (Guo et al., 1986)
- JURD980101
Modified Kyte-Doolittle hydrophobicity scale (Juretic et al., 1998)
- BASU050101
Interactivity scale obtained from the contact matrix (Bastolla et al., 2005)
- BASU050102
Interactivity scale obtained by maximizing the mean of correlation coefficient over single-domain globular proteins (Bastolla et al., 2005)
- BASU050103
Interactivity scale obtained by maximizing the mean of correlation coefficient over pairs of sequences sharing the TIM barrel fold (Bastolla et al., 2005)
- SUYM030101
Linker propensity index (Suyama-Ohara, 2003)
- PUNT030101
Knowledge-based membrane-propensity scale from 1D_Helix in MPtopo databases (Punta-Maritan, 2003)
- PUNT030102
Knowledge-based membrane-propensity scale from 3D_Helix in MPtopo databases (Punta-Maritan, 2003)
- GEOR030101
Linker propensity from all dataset (George-Heringa, 2003)
- GEOR030102
Linker propensity from 1-linker dataset (George-Heringa, 2003)
- GEOR030103
Linker propensity from 2-linker dataset (George-Heringa, 2003)
- GEOR030104
Linker propensity from 3-linker dataset (George-Heringa, 2003)
- GEOR030105
Linker propensity from small dataset (linker length is less than six residues) (George-Heringa, 2003)
- GEOR030106
Linker propensity from medium dataset (linker length is between six and 14 residues) (George-Heringa, 2003)
- GEOR030107
Linker propensity from long dataset (linker length is greater than 14 residues) (George-Heringa, 2003)
- GEOR030108
Linker propensity from helical (annotated by DSSP) dataset (George-Heringa, 2003)
- GEOR030109
Linker propensity from non-helical (annotated by DSSP) dataset (George-Heringa, 2003)
- ZHOH040101
The stability scale from the knowledge-based atom-atom potential (Zhou-Zhou, 2004)
- ZHOH040102
The relative stability scale extracted from mutation experiments (Zhou-Zhou, 2004)
- ZHOH040103
Buriability (Zhou-Zhou, 2004)
- BAEK050101
Linker index (Bae et al., 2005)
- HARY940101
Mean volumes of residues buried in protein interiors (Harpaz et al., 1994)
- PONJ960101
Average volumes of residues (Pontius et al., 1996)
- DIGM050101
Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005)
- WOLR790101
Hydrophobicity index (Wolfenden et al., 1979)
- OLSK800101
Average internal preferences (Olsen, 1980)
- KIDA850101
Hydrophobicity-related index (Kidera et al., 1985)
- GUYH850102
Apparent partition energies calculated from Wertz-Scheraga index (Guy, 1985)
- GUYH850103
Apparent partition energies calculated from Robson-Osguthorpe index (Guy, 1985)
- GUYH850104
Apparent partition energies calculated from Janin index (Guy, 1985)
- GUYH850105
Apparent partition energies calculated from Chothia index (Guy, 1985)
- ROSM880104
Hydropathies of amino acid side chains, neutral form (Roseman, 1988)
- ROSM880105
Hydropathies of amino acid side chains, pi-values in pH 7.0 (Roseman, 1988)
- JACR890101
Weights from the IFH scale (Jacobs-White, 1989)
- COWR900101
Hydrophobicity index, 3.0 pH (Cowan-Whittaker, 1990)
- BLAS910101
Scaled side chain hydrophobicity values (Black-Mould, 1991)
- CASG920101
Hydrophobicity scale from native protein structures (Casari-Sippl, 1992)
- CORJ870101
NNEIG index (Cornette et al., 1987)
- CORJ870102
SWEIG index (Cornette et al., 1987)
- CORJ870103
PRIFT index (Cornette et al., 1987)
- CORJ870104
PRILS index (Cornette et al., 1987)
- CORJ870105
ALTFT index (Cornette et al., 1987)
- CORJ870106
ALTLS index (Cornette et al., 1987)
- CORJ870107
TOTFT index (Cornette et al., 1987)
- CORJ870108
TOTLS index (Cornette et al., 1987)
- MIYS990101
Relative partition energies derived by the Bethe approximation (Miyazawa-Jernigan, 1999)
- MIYS990102
Optimized relative partition energies - method A (Miyazawa-Jernigan, 1999)
- MIYS990103
Optimized relative partition energies - method B (Miyazawa-Jernigan, 1999)
- MIYS990104
Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999)
- MIYS990105
Optimized relative partition energies - method D (Miyazawa-Jernigan, 1999)
- ENGD860101
Hydrophobicity index (Engelman et al., 1986)
- FASG890101
Hydrophobicity index (Fasman, 1989)
- K6.5
Values of Wc in proteins from class Beta, cutoff 6 A, separation 5 (Wozniak, 2014)
- K8.5
Values of Wc in proteins from class Beta, cutoff 8 A, separation 5 (Wozniak, 2014)
- K12.5
Values of Wc in proteins from class Beta, cutoff 12 A, separation 5 (Wozniak, 2014)
- K6.15
Values of Wc in proteins from class Beta, cutoff 6 A, separation 15 (Wozniak, 2014)
- K8.15
Values of Wc in proteins from class Beta, cutoff 8 A, separation 15 (Wozniak, 2014)
- K12.15
Values of Wc in proteins from class Beta, cutoff 12 A, separation 15 (Wozniak, 2014)
Source
AAIndex database.
References
Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database. Nucleic Acids Res., 28:374.
Wozniak, P. and Kotulska M. (2014) Characteristics of protein residue-residue contacts and their application in contact prediction. 20(11):2497
Examples
data(aaprop)
Add 1-grams
Description
Builds (n+1)-grams from n-grams.
Usage
add_1grams(ngram, u, seq_length)
Arguments
ngram |
a single n-gram. |
u |
|
seq_length |
length of an origin sequence. |
Details
n-grams are built by pasting every possible unigram in the every possible free position. The total length of n-gram (n plus total distance between elements of the n-gram) is limited by the length of an origin sequence, because the n-gram cannot be longer than an origin sequence.
Value
vector of n-grams (where n
is equal to the n
of the input plus one).
See Also
Reverse function: gap_ngrams
.
Examples
add_1grams("1_2.3.4_3.0", 1L:4, 8)
add_1grams("a.a_1", c("a", "b", "c"), 4)
Coerce feature_test object to a data frame
Description
Coerce results of test_features
function to a
data.frame
.
Usage
## S3 method for class 'feature_test'
as.data.frame(
x,
row.names = NULL,
optional = FALSE,
stringsAsFactors = FALSE,
...
)
Arguments
x |
object of class |
row.names |
ignored. |
optional |
ignored. |
stringsAsFactors |
logical: should the character vector be converted to a factor?. |
... |
additional arguments to be passed to or from methods. |
Value
a data frame with four columns: names of n-gram, p-values, occurrences in positive and negative sequences.
Binarize
Description
Binarizes a matrix.
Usage
binarize(x)
Arguments
x |
|
Value
a matrix
or simple_triplet_matrix
(depending on the input).
Calculate value of criterion
Description
Computes a chosen statistical criterion for each feature versus target vector.
Usage
calc_criterion(target, features, criterion_function)
Arguments
target |
|
features |
|
criterion_function |
a function calculating criterion. For a full list, see
|
Details
The permutation test implemented in biogram
uses several criterions to filter
important features. Each can be used by test_features
by specifying the
criterion
parameter.
Value
a integer
vector of length equal to the number of features
containing computed information gain values.
Note
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
See Also
Examples
tar <- sample(0L:1, 100, replace = TRUE)
feats <- matrix(sample(0L:1, 400, replace = TRUE), ncol = 4)
# Information Gain
calc_criterion(tar, feats, calc_ig)
# hi-squared-based measure
calc_criterion(tar, feats, calc_cs)
# Kullback-Leibler divergence
calc_criterion(tar, feats, calc_kl)
Calculate Chi-squared-based measure
Description
Computes Chi-squared-based measure between features and target vector.
Usage
calc_cs(feature, target, len_target, pos_target)
Arguments
feature |
feature vector. |
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
Value
A numeric
vector of length 1 representing computed Chi-square values.
Note
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
The function was designed to be as fast as possible subroutine of
calc_criterion
and might be cumbersome if directly called by a user.
See Also
chisq.test
- Pearson's chi-squared test for count data.
Examples
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_cs(feat, tar, 100, sum(tar))
Calculate encoding distance
Description
Computes the encoding distance between two encodings.
Usage
calc_ed(a, b, prop = NULL, measure)
Arguments
a |
encoding (see |
b |
encoding to which |
prop |
|
measure |
See the package vignette for more details. |
Value
an encoding distance.
See Also
calc_si
: compute the similarity index of two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
Examples
# calculate encoding distance between two encodings of amino acids
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"),
`2` = c("k", "h"),
`3` = c("d", "e"),
`4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"),
`2` = c("k", "h", "d", "e", "i"),
`3` = c("f", "r", "w", "y", "s", "t", "c", "n"))
calc_ed(aa1, aa2, measure = "pi")
# the encoding distance between two identical encodings is 0
calc_ed(aa1, aa1, measure = "pi")
Calculate IG for single feature
Description
Computes information gain of single feature and target vector.
Usage
calc_ig(feature, target, len_target, pos_target)
Arguments
feature |
feature vector. |
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
Details
The information gain term is used here (improperly) as a synonym of mutual information. It is defined as:
IG(X; Y) = \sum_{y \in Y} \sum_{x \in X} p(x, y) \log \left(\frac{p(x, y)}{p(x) p(y)} \right)
In biogram package information gain is computed using following relationship:
IG = E(S) - E(S|F)
Value
A numeric
vector of length 1 representing information gain in nats.
Note
During calculations 0 \log 0 = 0
. For a justification see References.
The function was designed to be afast subroutine of
calc_criterion
and might be cumbersome if directly called by a user.
References
Cover TM, Thomas JA Elements of Information Theory, 2nd Edition Wiley, 2006.
Examples
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_ig(feat, tar, 100, sum(tar))
Calculate KL divergence of features
Description
Computes Kullback-Leibler divergence between features and target vector.
Usage
calc_kl(feature, target, len_target, pos_target)
Arguments
feature |
feature vector. |
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
Value
A numeric
vector of length 1 representing Kullback-Leibler divergence
value.
Note
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
The function was designed to be as fast as possible subroutine of
calc_criterion
and might be cumbersome if directly called by a user.
References
Kullback S, Leibler RA On information and sufficiency. Annals of Mathematical Statistics 22 (1):79-86, 1951.
See Also
test_features
.
Kullback-Leibler divergence is calculated using KL.plugin
.
Examples
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_kl(feat, tar, 100, sum(tar))
Calculate partition index
Description
Computes the encoding distance between two encodings.
Usage
calc_pi(a, b)
Arguments
a |
encoding (see |
b |
encoding to which |
Details
The encoding distance between a
and b
is defined as the
minimum number of amino acids that have to be moved between subgroups of encoding
to make a
identical to b
(order of subgroups in the encoding and amino
acids in a group is unimportant).
If the parameter prop
is supplied, the encoding distance is normalized by the
factor equal to the sum of distances for each group in a
and the closest group
in b
. The position of a group is defined as the mean value of properties of
amino acids or nucleotides belonging the group.
See the package vignette for more details.
Value
an encoding distance.
See Also
calc_si
: compute the similarity index of two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
Examples
# calculate encoding distance between two encodings of amino acids
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"),
`2` = c("k", "h"),
`3` = c("d", "e"),
`4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"),
`2` = c("k", "h", "d", "e", "i"),
`3` = c("f", "r", "w", "y", "s", "t", "c", "n"))
calc_pi(aa1, aa2)
# the encoding distance between two identical encodings is 0
calc_pi(aa1, aa1)
Compute similarity index
Description
Computes similarity index between two encodings.
Usage
calc_si(a, b)
Arguments
a |
encoding (see |
b |
encoding to which |
Details
Briefly, the similarity index is a fraction of elements that have the same pairing in both encodings. Pairing is a binary variable, that has value 1 if two elements are in the same group and 0 if not. For more details, see references.
Value
the value of similarity index.
References
Stephenson, J.D., and Freeland, S.J. (2013). Unearthing the Root of Amino Acid Similarity. J Mol Evol 77, 159-169.
See Also
calc_ed
: calculate the encoding distance between two encodings.
Examples
# example from Stephenson & Freeland, 2013 (Fig. 6)
enc1 <- list(`1` = "A",
`2` = c("F", "E"),
`3` = c("C", "D", "G"))
enc2 <- list(`1` = c("A", "G"),
`2` = c("C", "D", "E", "F"))
enc3 <- list(`1` = c("D", "G"),
`2` = c("E", "F"),
`3` = c("A", "C"))
calc_si(enc1, enc2)
calc_si(enc2, enc3)
calc_si(enc1, enc3)
Check chosen criterion
Description
Checks if the criterion is viable or matches it to the list of implemented criterions.
Usage
check_criterion(input_criterion, criterion_names = c("ig", "kl", "cs"))
Arguments
input_criterion |
|
criterion_names |
list of implemented criterions, always in lowercase. |
Value
a list of three:
criterion name,
its function,
nice name for outputs.
See Also
Calculate the value of criterion: calc_criterion
.
Clustering of sequences based on regular expression
Description
Clusters sequences hierarchically with regular expressions. At each step we minimize number of degrees of freedom for all regular expressions needed to describe the data
Usage
cluster_reg_exp(ngrams)
Arguments
ngrams |
list of elements |
Details
Regular expression is a list of the length equal to the length of the input sequences. Each element of the list represents a position in the sequence and contains amino acid, that are likely to occure on this position.
Value
List of four
"regExps"regular expression in best clustering
"seqClustering"clustering of sequences in best clustering
"allRegExps"all regular expressions.
"allIndices"all clusterings
Examples
data(human_cleave)
#cluster_reg_exp is computationally expensive
results <- cluster_reg_exp(human_cleave[1L:10, 1L:4])
Code n-grams
Description
Code human-friendly representation of n-grams into a biogram format.
Usage
code_ngrams(decoded_ngrams)
Arguments
decoded_ngrams |
a |
Value
a character
vector of n-grams.
See Also
Inverse function: decode_ngrams
.
Examples
code_ngrams(c("11_2", "1__12", "222"))
code_ngrams(c("aaa_b", "d__aa", "abd"))
Construct and filter n-grams
Description
Builds and selects important n-grams stepwise.
Usage
construct_ngrams(
target,
seq,
u,
n_max,
conf_level = 0.95,
gap = TRUE,
use_heuristics = TRUE
)
Arguments
target |
|
seq |
a vector or matrix describing sequence(s). |
u |
|
n_max |
size of constructed n-grams. |
conf_level |
confidence level. |
gap |
|
use_heuristics |
if |
Details
construct_ngrams
starts by
extracting unigrams from the sequences, pasting them together in all combination and
choosing from them significant features (with p-value below conf_level
). The
chosen n-grams are further extended to the specified by n_max
size by pasting
unigrams at both ends.
The gap
parameter determines if construct_ngrams
performs the
feature selection on exact n-grams (gap
equal to FALSE) or on all features in the
Hamming distance 1 from the n-gram (gap
equal to TRUE).
Value
a vector of n-grams.
See Also
Feature filtering method: test_features
.
Examples
# to make the example faster, we run construct_ngrams() on the
# subset of data
deg_seqs <- degenerate(human_cleave[c(1L:100, 801L:900), 1L:9],
list(`1` = c(1, 6, 8, 10, 11, 18),
`2` = c(2, 13, 14, 16, 17),
`3` = c(5, 19, 20),
`4` = c(7, 9, 12, 15),
'5' = c(3, 4)))
bigrams <- construct_ngrams(human_cleave[c(1L:100, 801L:900), "tar"], deg_seqs, 1L:5, 2)
Detect and count multiple n-grams in sequences
Description
A convinient wrapper around count_ngrams
for counting multiple
values of n
and d
.
Usage
count_multigrams(
ns,
ds = rep(0, length(ns)),
seq,
u,
pos = FALSE,
scale = FALSE,
threshold = 0
)
Arguments
ns |
|
ds |
|
seq |
a vector or matrix describing sequence(s). |
u |
|
pos |
|
scale |
|
threshold |
|
Details
ns
vector and ds
vector must have equal length. Elements of
ds
vector are used as equivalents of d
parameter for respective values
of ns
. For example, if ns
is c(4, 4, 4)
, the ds
must be a list of
length 3. Each element of the ds
list must have length 3 or 1, as appropriate
for a d
parameter in count_ngrams
function.
Value
An integer
matrix with named columns. The naming conventions are the same
as in count_ngrams
.
Examples
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE)
# if ds parameter is not present, n-grams are calculated for distance 0
count_multigrams(c(3, 1), seq = seqs, u = 1L:4)
# calculate three times n-gram with the same length, but different distances between
# elements
count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)),
seqs, 1L:4, pos = TRUE)
Count n-grams in sequences
Description
Counts all n-grams or position-specific n-grams present in the input sequence(s).
Usage
count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)
Arguments
seq |
a vector or matrix describing sequence(s). |
n |
|
u |
|
d |
|
pos |
|
scale |
|
threshold |
|
Details
A distance
vector should be always n
- 1 in length.
For example when n
= 3, d
= c(1,2) means A_A__A. For n
= 4,
d
= c(2,0,1) means A__AA_A. If vector d
has length 1, it is recycled to
length n
- 1.
n-gram names follow a specific convention and have three parts for position-specific
n-grams and two parts otherwise. The parts are separated by _
. The .
symbol
is used to separate elements within a part. The general naming scheme is
POSITION_NGRAM_DISTANCE
. The optional POSITION
part of the name indicates
the actual position of the n-gram in the sequence(s) and will be present
only if pos
= TRUE
. This part is always a single integer. The NGRAM
part of the name is a sequence of elements in the n-gram. For example, 4.2.2
indicates the n-gram 422 (e.g. TCC). The DISTANCE
part of the name is a vector of
distance(s). For example, 0.0
indicates zero distances (continuous n-grams), while
1.2
represents distances for the n-gram A_A__A.
Examples of n-gram names:
46_4.4.4_0.1 : trigram 44_4 on position 46
12_2.1_2 : bigram 2__1 on position 12
8_1.1.1_0.0 : continuous trigram 111 on position 8
1.1.1_0.0 : continuous trigram 111 without position information
Value
a simple_triplet_matrix
where columns represent
n-grams and rows sequences. See Details
for specifics of the naming convention.
Note
By default, the counted n-gram data is stored in a memory-saving format.
To convert an object to a 'classical' matrix use the as.matrix
function. See examples for further information.
See Also
Create vector of possible n-grams: create_ngrams
.
Extract n-grams from sequence(s): seq2ngrams
.
Get indices of n-grams: get_ngrams_ind
.
Count n-grams for multiple values of n: count_multigrams
.
Count only specified n-grams: count_specified
.
Examples
# count trigrams without position information for nucleotides
count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE)
# count position-specific trigrams from multiple nucleotide sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE)
# output results of the n-gram counting to screen
as.matrix(ngrams)
Count specified n-grams
Description
Counts specified n-grams in the input sequence(s).
Usage
count_specified(seq, ngrams)
Arguments
seq |
vector or matrix describing sequence(s). |
ngrams |
vector of n-grams. |
Details
count_specified
counts only selected n-grams declared by
user in the ngrams
parameter. Declared n-grams must be written using the
biogram
notation.
Value
A simple_triplet_matrix
where columns represent
n-grams and rows sequences.
See Also
Count all possible n-grams: count_ngrams
.
Examples
seqs <- matrix(c(1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 2, 4), nrow = 2)
count_specified(seqs, ngrams = c("1.1.1_0.0", "2.2.2_0.0", "1.1.2_0.0"))
seqs <- matrix(sample(1L:5, 200, replace = TRUE), nrow = 20)
count_specified(seqs, ngrams = c("2_4.2_0", "2_1.4_0", "3_1.3_0",
"2_4.2_1", "2_1.4_1", "3_1.3_1",
"2_4.2_2", "2_1.4_2", "3_1.3_2"))
Count total number of n-grams
Description
Computes total number of n-grams that can be extracted from sequences.
Usage
count_total(seq, n, d)
Arguments
seq |
a vector or matrix describing sequence(s). |
n |
|
d |
|
Details
The maximum number of possible n-grams is limited by their length and the distance between elements of the n-gram.
Value
An integer
rperesenting the total number of n-grams.
Note
A format of d
vector is discussed in Details of
count_ngrams
. The maximum
Examples
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
# make several sequences shorter by replacing them partially with NA
seqs[8L:11, 46L:50] <- NA
seqs[1L, 31L:50] <- NA
count_total(seqs, 3, c(1, 0))
Create encoding
Description
Reduces an alphabet using physicochemical properties.
Usage
create_encoding(prop, len)
Arguments
prop |
|
len |
length of the resulting encoding. Must be larger than zero and smaller than number of elements in the alphabet. |
Details
The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.
Value
An encoding.
See Also
calc_ed
: calculate the encoding distance between two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
Examples
enc1 = list(`1` = c("a", "t"),
`2` = c("g", "c"))
encoding2df(enc1)
Create feature according to given contingency matrix
Description
Creates a matrix of features and target based on the values from contingency matrix.
Usage
create_feature_target(n11, n01, n10, n00)
Arguments
n11 |
number of elements for which both target and feature equal 1. |
n01 |
number of elements for which target and feature equal 1,0 respectively. |
n10 |
number of elements for which target and feature equal 0,1 respectively. |
n00 |
number of elements for which both target and feature equal 0. |
Value
a matrix of 2 columns and n11+n10+n01+n00 rows. Columns represent target and feature vectors, respectively.
Examples
# equivalent of
# target
# feature 10 375
# 15 600
target_feature <- create_feature_target(10, 375, 15, 600)
Get all possible n-Grams
Description
Creates the vector of all possible n_grams (for given n
).
Usage
create_ngrams(n, u, possible_grams = NULL)
Arguments
n |
|
u |
|
possible_grams |
number of possible n-grams. If not |
Details
See Details section of count_ngrams
for more
information about n-grams naming convention. The possible information about distance
must be added by hand (see examples).
Value
a character vector. Elements of n-gram are separated by dot.
Note
Input data must be a matrix or data frame of numeric elements.
Examples
# bigrams for standard aminoacids
create_ngrams(2, 1L:20)
# bigrams for standard aminoacids with positions, 10 amino acid long sequence, so
# only 9 bigrams can be located in sequence
create_ngrams(2, 1L:20, 9)
# bigrams for DNA with positions, 10 nucleotide long sequence, distance 1, so only
# 8 bigrams in sequence
# paste0 adds information about distance at the end of n-gram
paste0(create_ngrams(2, 1L:4, 8), "_0")
criterion_distribution class
Description
A result of distr_crit
function.
Details
An object of class criterion_distribution
is a numeric matrix.
Data
- 1st column:
possible values of criterion.
- 2nd column:
probability density function.
- 3rd column:
cumulative distribution function.
Attributes
- plot_data
A matrix with values of the criterion and their probabilities.
- nice_name
'Nice' name of the criterion.
Categorize tested features
Description
Categorizes results of test_features
function into groups based on their
significance.
Usage
## S3 method for class 'feature_test'
cut(x, split = "significances", breaks = c(0, 1e-04, 0.01, 0.05, 1), ...)
Arguments
x |
an object of class |
split |
attribute along which output should be categorized. Possible values are
|
breaks |
a vector of significances of frequencies along which n-grams are aggregated.
See description of |
... |
further parameters accepted by the |
Value
the value of function depends on the split
parameter.
The function returns a named list of length equal to the length
of significances
(when split
equals "significances"
) or
frequencies
(when split
equals "positives"
or "negatives"
)
minus one. Each elements of the list contains names of the n-grams belonging to the given
significance or frequency group.
Decode n-grams
Description
Transforms a vector of n-grams into a human-friendly form.
Usage
decode_ngrams(ngrams)
Arguments
ngrams |
a |
Value
a character
vector of length equal to the number of n-grams.
Note
Decoded n-grams lose the position information.
See Also
Validate n-gram structure: is_ngram
.
Inverse function: code_ngrams
.
Examples
decode_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
Degenerate protein sequence
Description
'Degenerates' amino acid or nucleic sequence by aggregating elements to bigger groups.
Usage
degenerate(seq, element_groups)
Arguments
seq |
|
element_groups |
encoding of elements: list of groups to which elements of sequence should be aggregated. Must have unique names. |
Value
A character
vector or matrix (if input is a matrix)
containing aggregated elements.
Note
Characters not present in the element_groups
will be converted to NA with a
warning.
See Also
l2n
to easily convert information stored in biological sequences from
letters to numbers.
calc_ed
to calculate distance between encodings.
Examples
sample_seq <- c(1, 3, 1, 3, 4, 4, 3, 1, 2)
table(sample_seq)
# aggregate sequence to purins and pyrimidines
deg_seq <- degenerate(sample_seq, list(w = c(1, 4), s = c(2, 3)))
table(deg_seq)
Degenerate n-grams
Description
'Degenerates' n-grams by aggregating amino acid or nucleotide elements into bigger groups.
Usage
degenerate_ngrams(x, element_groups, binarize = FALSE)
Arguments
x |
object containing n-grams. |
element_groups |
encoding of elements: list of groups to which elements of n-grams should be aggregated. Must have unique names. |
binarize |
logical indicating if n-grams should be binarized |
Value
A character
vector or matrix (if input is a matrix)
containing degenerated n-grams.
Compute criterion distribution
Description
Computes criterion distribution under null hypothesis for all contingency tables possible for a feature and a target.
Usage
distr_crit(target, feature, criterion = "ig", iter_limit = 200)
Arguments
target |
{0,1}-valued target vector. See Details. |
feature |
{0,1}-valued feature vector. See Details. |
criterion |
criterion used for calculations of distribution.
See |
iter_limit |
limit the number of calculated contingence matrices. If
|
Details
both target
and feature
vectors may contain only 0
and 1.
Value
An object of class criterion_distribution
.
See Also
Examples
target_feature <- create_feature_target(10, 375, 15, 600)
distr_crit(target = target_feature[,1], feature = target_feature[,2])
Convert encoding to data frame
Description
Converts an encoding to a data frame.
Usage
encoding2df(x, sort = FALSE)
Arguments
x |
encoding. |
sort |
if |
Details
The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.
Value
data frame with two columns. First column represents an index of a group in the supplied encoding and the second column contains all elements of the encoding.
See Also
calc_ed
: calculate the encoding distance between two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
Examples
create_encoding(aaprop[1L:5, ], 5)
2d cross-tabulation
Description
Quickly cross-tabulates two binary vectors.
Usage
fast_crosstable(target, len_target, pos_target, feature)
Arguments
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
feature |
feature vector. |
Details
Input looks odd, but the function was build to be fast
subroutine of calc_ig
, which works on
many features but only one target.
Value
a vector of length four:
target +, feature+
target +, feature-
target -, feature+
target -, feature-
Note
Binary vector means a numeric vector with 0 or 1.
Examples
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
fast_crosstable(tar, length(tar), sum(tar), feat)
feature_test class
Description
A result of test_features
function.
Details
An object of the feature_test
class is a numeric vector of p-values.
Additional attributes characterizes futher the details of test which returned these
p-values.
Attributes
- criterion
the criterion used in permutation test.
- adjust
the name of p-value adjusting method.
- times
the number of permutations. If QuiPT was chosen
NA
.- occ
frequency of features splitted in subset based on the value of target.
See Also
Methods:
Convert encoding from full to simple format
Description
Converts an encoding from the full format to the simple format.
Usage
full2simple(x)
Arguments
x |
encoding. |
Examples
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"),
`2` = c("k", "h"),
`3` = c("d", "e"),
`4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
full2simple(aa1)
Gap n-grams
Description
Introduces gaps in the n-grams.
Usage
gap_ngrams(ngrams)
Arguments
ngrams |
a vector of positioned n-grams (as created by |
Details
A single element of the input n-gram at a time will be replaced
by a gap. For example, introducing gaps in n-gram 2_1.1.2_0.1
will results in three n-grams: 3_1.2_1
(where the 2_1_0
unigram
was replaced by a gap), 2_1.2_2
and 2_1.1_0
.
Value
A character
vector of (n-1)-grams with introduced gaps.
See Also
Reverse function: add_1grams
.
Examples
gap_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"))
gap_ngrams(c("1.1.2_0.1", "1.1.2_0.0", "2.2.2_0.0"))
Generate sequence
Description
Generate a sequences using an alphabet of unigrams and set of rules.
Usage
generate_sequence(alphabet, regions)
Arguments
alphabet |
the unigram alphabet. Columns are equivalent to unigrams and rows to particular properties. |
regions |
a list of rules describing regions. |
Generate single region
Description
Generate a region using an alphabet of unigrams and considering provided set of rules.
Usage
generate_single_region(alphabet, reg_len, prop_ranges, exactness)
Arguments
alphabet |
the unigram alphabet. Columns are equivalent to unigrams and rows to particular properties. |
reg_len |
the number of unigrams inside the region. |
prop_ranges |
required intervals of properties of unigrams in the region. See Details. |
exactness |
a |
Examples
props1 <- list(P1 = c(0, 0.5),
P2 = c(0.2, 0.4),
P3 = c(0.5, 1),
P4 = c(0, 0))
props2 <- list(P1 = c(0.5, 1),
P2 = c(0.4, 1),
P3 = c(0, 0.5),
P4 = c(1, 1))
alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE),
replicate(12, props2, simplify = FALSE)),
unigram_names = letters[1L:20])
rules1 <- list(P1 = c(0.5, 1),
P2 = c(0.4, 1),
P3 = c(0, 0.5),
P4 = c(1, 1))
generate_single_region(alph, 10, rules1, 0.9)
Generate single unigram
Description
Assign randomly generated properties to a single unigram.
Usage
generate_single_unigram(unigram_ranges)
Arguments
unigram_ranges |
list of ranges containing respective properties. If named, names are preserved. |
See Also
generate_single_unigram
is a helper function for
generate_unigrams
.
Examples
generate_single_unigram(list(P1 = c(0, 0.5),
P2 = c(0.2, 0.4),
P3 = c(0.5, 1),
P4 = c(0, 0)))
Generate unigrams
Description
Generates an alphabet of unigrams based on given list of properties.
Usage
generate_unigrams(unigram_list, unigram_names = NULL, prop_names = NULL)
Arguments
unigram_list |
a list of unigrams' parameters. See Details. |
unigram_names |
names of unigrams. If not |
prop_names |
names of properties. If not |
Details
Unigram parameters are represented as a list of intervals, where each interval corresponds to a different property. The function generate unigrams randomly choosing values of properties from given intervals using uniform distribution. All lists of ranges should have the same length, which equils to describing each unigram using the same properties.
Examples
props1 <- list(P1 = c(0, 0.5),
P2 = c(0.2, 0.4),
P3 = c(0.5, 1),
P4 = c(0, 0))
props2 <- list(P1 = c(0.5, 1),
P2 = c(0.4, 1),
P3 = c(0, 0.5),
P4 = c(1, 1))
alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE),
replicate(12, props2, simplify = FALSE)),
unigram_names = letters[1L:20])
Get indices of n-grams
Description
Computes list of n-gram elements positions in sequence.
Usage
get_ngrams_ind(len_seq, n, d)
Arguments
len_seq |
|
n |
|
d |
|
Details
A format of d
vector is discussed in Details of
count_ngrams
.
Value
A list with number of elements equal to n
. Every element is a
vector containing locations of given n-gram letter. For example, first element of
list contain indices of first letter of all n-grams. The attribute d
of output contains distances between letter used to compute locations
(see Details).
Examples
# positions trigrams in sequence of length 10
get_ngrams_ind(10, 9, 0)
Human signal peptides cleavage sites
Description
A set of 648 cleavage sites and 648 parts of mature proteins shortly after cleavage sites derived from human proteome.
Format
A data frame with 1296 observations on the following 10 variables. Columns from
P1
to P9
describes positions in an extracted peptide. tar
is a target vector. It
has value 1 if a peptide is a cleavage site and 0 if not.
Details
Each peptide in the data set is nine amino acid residues long. In case of cleavage sites, the clevage is located between fifth and sixth peptide. The non-cleavage sites are parts of mature proteins starting five positions after cleavage site.
Note
Amino acid residues were recoded as integers.
Source
Examples
data(human_cleave)
table(human_cleave[, 1])
Validate n-gram
Description
Checks if the character string may be used as an n-gram and its notation follows specific
convention of biogram
package.
Usage
is_ngram(x)
Arguments
x |
|
Value
TRUE
if n-gram's notation is correct, FALSE
if not.
Examples
print(is_ngram("1_1.1.1_0.0"))
print(is_ngram("not_ngram"))
Convert letters to numbers
Description
Converts biological sequence from letter to number notation.
Usage
l2n(seq, seq_type)
Arguments
seq |
|
seq_type |
the type of sequence. Can be |
Value
a numeric
vector or matrix containing converted elements.
See Also
l2n
is a wrapper around degenerate
.
Inverse function: n2l
.
Examples
sample_seq <- c("a", "d", "d", "g", "a", "g", "n", "a", "l")
l2n(sample_seq, "prot")
Convert list of sequences to matrix
Description
Converts list of sequences to matrix.
Usage
list2matrix(seq_list)
Arguments
seq_list |
list of sequences (e.g. as returned by
the |
Value
A matrix with the number of rows equal to the number of sequences and the number of columns equal to the length of the longest sequence.
Note
Since matrix must have specified number of columns, ends of shorter sequences are completed with NAs.
Examples
list2matrix(list(s1 = c("c", "g", "g", "t"),
s2 = c("g", "t", "c", "t", "t", "g"),
s3 = c("a", "a", "t")))
Convert numbers to letters
Description
Converts biological sequence from number to letter notation.
Usage
n2l(seq, seq_type)
Arguments
seq |
|
seq_type |
the type of sequence. Can be |
Value
a character
vector or matrix containing converted elements.
See Also
n2l
is a wrapper around degenerate
.
Inverse function: l2n
.
Examples
sample_seq <- c(1, 3, 3, 6, 1, 6, 12, 1, 10)
n2l(sample_seq, "prot")
n-grams to data frame
Description
Tranforms a vector of n-grams into a data frame.
Usage
ngrams2df(ngrams)
Arguments
ngrams |
a |
Value
a data.frame
with 2 (in case of n-grams without known position) or
three columns (n-grams with position information).
See Also
Decode n-grams: decode_ngrams
.
Examples
ngrams2df(c("2_1.1.2_0.0", "3_1.1.2_0.0", "3_2.2.2_0.0", "2_1.1_0"))
Plot criterion distribution
Description
Plots results of distr_crit
function.
Usage
## S3 method for class 'criterion_distribution'
plot(x, ...)
Arguments
x |
object of class |
... |
further arguments passed to |
Value
nothing.
Examples
target_feature <- create_feature_target(10, 375, 15, 600)
example_result <- distr_crit(target = target_feature[,1],
feature = target_feature[,2])
plot(example_result)
# a ggplot2 plot
library(ggplot2)
ggplot_distr <- function(x) {
b <- data.frame(cbind(x=as.numeric(rownames(attr(x, "plot_data"))),
attr(x, "plot_data")))
d1 <- cbind(b[,c(1,2)], attr(x, "nice_name"))
d2 <- cbind(b[,c(1,3)], "Probability")
colnames(d1) <- c("x", "y", "panel")
colnames(d2) <- c("x", "y", "panel")
d <- rbind(d1, d2)
p <- ggplot(data = d, mapping = aes(x = x, y = y)) +
facet_grid(panel~., scale="free") +
geom_freqpoly(data= d2, aes(color=y), stat = "identity") +
scale_fill_brewer(palette = "Set1") +
geom_point(data=d1, aes(size=y), stat = "identity") +
guides(color = "none") +
guides(size = "none") +
xlab("Number of cases with feature=1 and target=1") + ylab("")
p
}
ggplot_distr(example_result)
Position n-grams
Description
Tranforms a vector of positioned n-grams into a list of positions filled with n-grams that start on them.
Usage
position_ngrams(ngrams, df = FALSE, unigrams_output = TRUE)
Arguments
ngrams |
a vector of positioned n-grams (as created by |
df |
logical, if |
unigrams_output |
logical, if |
Value
if df
is FALSE
, returns a list of length equal to the number of unique
n-gram starts present in n-grams. Each element of the list contains n-grams that start on
this position. If df
is FALSE
, returns a data frame where first column contains
n-grams and the second column represent their start positions.
See Also
Transform n-gram name to human-friendly form: decode_ngrams
.
Validate n-gram structure: is_ngram
.
Examples
# position data in the list format
position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"))
# position data in the data frame format
position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"), df = TRUE)
Print tested features
Description
Prints results of test_features
function.
Usage
## S3 method for class 'feature_test'
print(x, ...)
Arguments
x |
object of class |
... |
further arguments passed to |
Value
nothing.
Read FASTA files
Description
A lightweight tool to read nucleic or amino-acid sequences from a file in FASTA format.
Usage
read_fasta(file)
Arguments
file |
the name of the file which the data are to be read from. |
Value
a list of sequences.
See Also
read.fasta
: heavier function for processing FASTA files.
Examples
## Not run:
read_fasta("https://www.uniprot.org/uniprot/P28307.fasta")
## End(Not run)
Regenerate n-grams
Description
'Regenerates' amino acid or nucleic sequence written in a simplified alphabet by converting groups to regular expression.
Usage
regenerate(x, element_groups)
Arguments
x |
|
element_groups |
encoding of elements: list of groups to which elements of sequence should be aggregated. Must have unique names. |
Value
A character
string representing a POSIX regular expression.
Note
Gaps (_
) will be converted to any possible character from the alphabet
(nucleotides or amino acids).
See Also
degenerate
to easily convert information stored in biological sequences from
letters to numbers.
calc_ed
to calculate distance between simplified alphabets.
Examples
regenerate("ssw", list(w = c(1, 4), s = c(2, 3)))
regional_param class
Description
List of rules defining the region.
Details
An object of the regional_param
class is a list consisting of all rules
necessary to properly build a region.
Attributes
- reg_len
the number of unigrams inside the region. Might be 0
- prop_ranges
required intervals of properties of unigrams in the region
- exactness
a
numeric
value between 0 and 1 defining how stricly unigrams are kept withinprop_ranges
. If 1, only unigrams withinprop_ranges
are inside the region. if 0.9, there is 10 unigrams that are not in theprop_ranges
will be inside the region.
See Also
Extract n-grams from sequence
Description
Extracts vector of n-grams present in sequence(s).
Usage
seq2ngrams(seq, n, u, d = 0, pos = FALSE)
Arguments
seq |
a vector or matrix describing sequence(s). |
n |
|
u |
|
d |
|
pos |
|
Details
A format of d
vector is discussed in Details of
count_ngrams
.
Value
A character
matrix of n-grams, where every row corresponds to a
different sequence.
Examples
# trigrams from multiple sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
seq2ngrams(seqs, 3, 1L:4)
Convert encoding from simple to full format
Description
Converts an encoding from the simple format to the full format.
Usage
simple2full(x)
Arguments
x |
encoding (see Details). |
Details
The encoding should be named. Each name should correspond to a different amino acid or nucleotide.
Examples
aa1 = structure(c("1", "4", "3", "3", "4", "1", "2", "1", "2", "1",
"1", "4", "1", "4", "4", "4", "4", "1", "4", "4"),
.Names = c("a", "c", "d", "e", "f", "g", "h", "i",
"k", "l", "m", "n", "p", "q",
"r", "s", "t", "v", "w", "y"))
simple2full(aa1)
Summarize tested features
Description
Summarizes results of test_features
function.
Usage
## S3 method for class 'feature_test'
summary(object, conf_level = 0.95, ...)
Arguments
object |
of class |
conf_level |
confidence level. A feature with p-value equal to or smaller than the confidence is considered significant. |
... |
ignored |
Value
nothing.
Tabulate n-grams
Description
Builds a contingency table of the n-gram counts versus their class labels.
Usage
table_ngrams(seq, ngrams, target)
Arguments
seq |
vector or matrix describing sequence(s). |
ngrams |
vector of n-grams. |
target |
|
Value
a data frame with the number of columns equal to the length of the
target
plus 1. The first column contains names of the n-grams. Further
columns represents counts of n-grams for respective value of the
target
.
Examples
seqs_pos <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE,
prob = c(0.2, 0.4, 0.35, 0.05)), ncol = 5)
seqs_neg <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE),
ncol = 5)
tab <- table_ngrams(seq = rbind(seqs_pos, seqs_neg),
ngrams = c("1_c.t_0", "1_g.g_0", "2_t.c_0", "2_g.g_0", "3_c.c_0", "3_g.c_0"),
target = c(rep(1, 20), rep(0, 20)))
# see the results
print(tab)
# easily plot the results using ggplot2
Permutation test for feature selection
Description
Performs a feature selection on positioned n-gram data using a Fisher's permutation test.
Usage
test_features(
target,
features,
criterion = "ig",
adjust = "BH",
threshold = 1,
quick = TRUE,
times = 1e+05
)
Arguments
target |
|
features |
|
criterion |
criterion used in permutation test. See Details for the list of possible criterions. |
adjust |
name of p-value adjustment method. See |
threshold |
|
quick |
|
times |
number of times procedure should be repeated. Ignored if |
Details
Since the procedure involves multiple testing, it is advisable to use one
of the avaible p-value adjustment methods. Such methods can be used directly by
specifying the adjust
parameter.
Available criterions:
- ig
Information Gain:
calc_ig
.- kl
Kullback-Leibler divergence:
calc_kl
.- cs
Chi-squared-based measure:
calc_cs
.
Value
an object of class feature_test
.
Note
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
Features occuring too often and too rarely are considered not informative and may be removed using the threshold parameter.
References
Radivojac P, Obradovic Z, Dunker AK, Vucetic S, Feature selection filters based on the permutation test in Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Springer, 2004.
See Also
binarize
- binarizes input data.
calc_criterion
- computes selected criterion.
distr_crit
- distribution of criterion used in QuiPT.
summary.feature_test
- summary of results.
cut.feature_test
- aggregates test results in groups based on feature's
p-value.
Examples
# significant feature
tar_feat1 <- create_feature_target(10, 390, 0, 600)
# significant feature
tar_feat2 <- create_feature_target(9, 391, 1, 599)
# insignificant feature
tar_feat3 <- create_feature_target(198, 202, 300, 300)
test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2],
tar_feat3[, 2]))
summary(test_res)
cut(test_res)
# real data example
# we will analyze only a subsample of a dataset to make analysis quicker
ids <- c(1L:100, 701L:800)
deg_seqs <- degenerate(human_cleave[ids, 1L:9],
list(`a` = c(1, 6, 8, 10, 11, 18),
`b` = c(2, 5, 13, 14, 16, 17, 19, 20),
`c` = c(3, 4, 7, 9, 12, 15)))
# positioned n-grams example
bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], bigrams_pos)
# unpositioned n-grams example, binarization required
bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], binarize(bigrams_notpos))
Validate encoding
Description
Checks the structure of an encoding.
Usage
validate_encoding(x, u)
Arguments
x |
encoding. |
u |
|
Details
The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.
Value
TRUE
if the x
is a correctly reduced u
,
FALSE
in any other cases.
See Also
calc_ed
: calculate the encoding distance between two encodings.
encoding2df
: converts an encoding to a data frame.
Examples
enc1 = list(`1` = c("a", "t"),
`2` = c("g", "c"))
# see if enc1 is the correctly reduced nucleotide (DNA) alphabet
validate_encoding(enc1, c("a", "c", "g", "t"))
# enc1 is not the RNA alphabet, so the results is FALSE
validate_encoding(enc1, c("a", "c", "g", "u"))
# validate_encoding works also on other notations
enc2 = list(a = c(1, 4),
b = c(2, 3))
validate_encoding(enc2, 1L:4)
Write encodings to a file
Description
Saves a list of encodings (or a single encoding to the file).
Usage
write_encoding(x, file = "")
Arguments
x |
encoding or list of encodings. |
file |
ither a character string naming a file or a
|
Examples
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"),
`2` = c("k", "h"),
`3` = c("d", "e"),
`4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
write_encoding(aa1)
Write FASTA files
Description
A lightweight tool to read nucleic or amino-acid sequences from a file in FASTA format.
Usage
write_fasta(seq, file, nchar = 80)
Arguments
seq |
a list of sequences. |
file |
the name of the output file. |
nchar |
the number of characters per line. |
See Also
write.fasta
: heavier function for writing FASTA files.