Title: Detect Elevations and Gaps in Mapped Sequencing Read Coverage
Version: 0.1.0
Maintainer: Jessie Maier <jlmaier@ncsu.edu>
Description: Automate the detection of gaps and elevations in mapped sequencing read coverage using a 2D pattern-matching algorithm. 'ProActive' detects, characterizes and visualizes read coverage patterns in both genomes and metagenomes. Optionally, users may provide gene annotations associated with their genome or metagenome in the form of a .gff file. In this case, 'ProActive' will generate an additional output table containing the gene annotations found within the detected regions of gapped and elevated read coverage. Additionally, users can search for gene annotations of interest in the output read coverage plots.
License: GPL-2
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
URL: https://github.com/jlmaier12/ProActive, https://jlmaier12.github.io/ProActive/
BugReports: https://github.com/jlmaier12/ProActive/issues
Imports: utils, stats, dplyr, ggplot2, stringr
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0), kableExtra
VignetteBuilder: knitr
Depends: R (≥ 4.2.0)
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2025-01-20 20:39:54 UTC; jlmaier
Author: Jessie Maier ORCID iD [aut, cre, cph], Manuel Kleiner ORCID iD [aut, ths]
Repository: CRAN
Date/Publication: 2025-01-21 08:00:02 UTC

ProActive

Description

'ProActive' automatically detects regions of gapped and elevated read coverage using a 2D pattern-matching algorithm. 'ProActive' detects, characterizes and visualizes read coverage patterns in both genomes and metagenomes. Optionally, users may provide gene annotations associated with their genome or metagenome in the form of a .gff file. In this case, 'ProActive' will generate an additional output table containing the gene annotations found within the detected regions of gapped and elevated read coverage. Additionally, users can search for gene annotations of interest in the output read coverage plots.

Details

The three main functions in 'ProActive' are:

  1. ProActiveDetect performs the pattern-matching and characterization of read coverage patterns.

  2. plotProActiveResults plots the results from ProActiveDetect()

  3. geneAnnotationSearch searches classified contigs/chunks for gene annotations that match user-provided keywords.

Author(s)

Jessie Maier jlmaier@ncsu.edu

See Also

Useful links:


Detect gene predictions in elevations and gaps

Description

Extracts subsets of the gffTSV associated with gene predictions that fall within regions of detected gapped or elevated read coverage.

Usage

GPsInElevGaps(
  elevGapSummList,
  windowSize,
  gffTSV,
  mode,
  chunkContigs,
  chunkSize
)

Arguments

elevGapSummList

A list containing pattern-match information associated with all elevation and gap classifications. (i.e. no NoPattern classifications)

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

gffTSV

Optional, a .gff file (TSV) containing gene predictions associated with the .fasta file used to generate the pileup.

mode

Either "genome" or "metagenome"

chunkContigs

TRUE or FALSE, If TRUE and 'mode'="metagenome", contigs longer than the ‘chunkSize' will be ’chunked' into smaller subsets and pattern-matching will be performed on each subset. Default is FALSE.

chunkSize

If 'mode'="genome" OR if 'mode'="metagenome" and 'chunkContigs'=TRUE, chunk the genome or contigs, respectively, into smaller subsets for pattern-matching.


Detect elevations and gaps in mapped read coverage patterns.

Description

Performs read coverage pattern-matching and summarizes the results into a list. The first list item summarizes the pattern-matching results. The second list item is the 'cleaned' version of the summary table with all the 'noPattern' classifications removed. (i.e were not filtered out). The third list item contains the pattern-match information needed for pattern-match visualization with 'plotProActiveResults()'. The fourth list item is a table containing all the contigs that were filtered out prior to pattern-matching. The fifth list item contains arguments used during pattern-matching (windowSize, mode, chunkSize, chunkContigs). If the user provides a gffTSV files, then the last list is a table consisting of ORFs found within the detected gaps and elevations in read coverage.

Usage

ProActiveDetect(
  pileup,
  mode,
  gffTSV,
  windowSize = 1000,
  chunkContigs = FALSE,
  minSize = 10000,
  maxSize = Inf,
  minContigLength = 30000,
  chunkSize = 1e+05,
  IncludeNoPatterns = FALSE,
  verbose = TRUE,
  saveFilesTo
)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

mode

Either "genome" or "metagenome"

gffTSV

Optional, a .gff file (TSV) containing gene predictions associated with the .fasta file used to generate the pileup.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

chunkContigs

TRUE or FALSE, If TRUE and 'mode'="metagenome", contigs longer than the ‘chunkSize' will be ’chunked' into smaller subsets and pattern-matching will be performed on each subset. Default is FALSE.

minSize

The minimum size (in bp) of elevation or gap patterns. Default is 10000.

maxSize

The maximum size (in bp) of elevation or gap patterns. Default is NA (i.e. no maximum).

minContigLength

The minimum contig/chunk size (in bp) to perform pattern-matching on. Default is 25000.

chunkSize

If 'mode'="genome" OR if 'mode'="metagenome" and 'chunkContigs'=TRUE, chunk the genome or contigs, respectively, into smaller subsets for pattern-matching. ‘chunkSize' determines the size (in bp) of each ’chunk'. Default is 100000.

IncludeNoPatterns

TRUE or FALSE, If TRUE the noPattern pattern-matches will be included in the ProActive PatternMatches output list. If you would like to visualize the noPattern pattern-matches in 'plotProActiveResults()', this should be set to TRUE.

verbose

TRUE or FALSE. Print progress messages to console. Default is TRUE.

saveFilesTo

Optional, Provide a path to the directory you wish to save output to. A folder will be made within the provided directory to store results.

Value

A list containing 6 objects described in the function description.

Examples

metagenome_results <- ProActiveDetect(
  pileup = sampleMetagenomePileup,
  mode = "metagenome",
  gffTSV = sampleMetagenomegffTSV
)

Change the pileup window size

Description

Re-averages windows of pileup files with 100bp windows to reduce pileup size.

Usage

changewindowSize(pileupSubset, windowSize, mode)

Arguments

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

mode

Either "genome" or "metagenome"


Summarizes pattern-matching results

Description

Summarizes the list of pattern-matching classifications into a table.

Usage

classifSumm(pileup, bestMatchList, windowSize, mode, chunkSize)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

bestMatchList

A list containing pattern-match information associated with all contigs/chunks classified by 'ProActive()' pattern-matching

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

mode

Either "genome" or "metagenome"

chunkSize

If 'mode'="genome" OR if 'mode'="metagenome" and 'chunkContigs'=TRUE, chunk the genome or contigs, respectively, into smaller subsets for pattern-matching.


Collect information regarding the pattern-match

Description

Make a list containing the match-score, min and max pattern-match values, the start and stop positions of the elevated or gapped region, the elevation ratio and the classification

Usage

collectBestMatchInfo(pattern, pileupSubset, elevOrGap, leftRightFull)

Arguments

pattern

A vector containing the values associated with the pattern-match

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

elevOrGap

Pattern-matching on 'elevation' or 'gap' pattern.

leftRightFull

'Left' or'Right' partial gap/elevation pattern or full elevation/gap pattern.


'chunk' long contigs

Description

Subset long contigs in metagenome pileup into chunks for pattern-matching

Usage

contigChunks(pileup, chunkSize)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

chunkSize

If 'mode'="genome" OR if 'mode'="metagenome" and 'chunkContigs'=TRUE, chunk the genome or contigs, respectively, into smaller subsets for pattern-matching. ‘chunkSize' determines the size (in bp) of each ’chunk'. Default is 100000.


Classifies partial elevation/gap pattern-matches

Description

classify the contig/chunk as 'gap' if the elevated region is less than 50 the length of the contig/chunk and otherwise classify as 'elevation'.

Usage

elevOrGapClassif(bestMatchList, pileupSubset)

Arguments

bestMatchList

A list containing pattern-match information associated with all contigs/chunks classified by 'ProActive()' pattern-matching

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.


Controller function for full elevation/gap pattern-matching

Description

Builds full elevation/gap pattern-matches, shrinks the width, and collects best match information

Usage

fullElevGap(pileupSubset, windowSize, minSize, maxSize, elevOrGap)

Arguments

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

minSize

The minimum size (in bp) of elevation or gap patterns. Default is 10000.

maxSize

The maximum size (in bp) of elevation or gap patterns. Default is NA (i.e. no maximum).

elevOrGap

Pattern-matching on 'elevation' or 'gap' pattern.


Shrink the width of full elevation and gap patterns

Description

Remove values from gapped/elevated region in the pattern-match vector until it reaches the 'minSize'

Usage

fullElevGapShrink(
  minCov,
  windowSize,
  maxCov,
  elevLength,
  nonElev,
  bestMatchInfo,
  pileupSubset,
  minSize,
  elevOrGap
)

Arguments

minCov

The minimum value of the pattern-match vector.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

maxCov

The maximum value of the pattern-match vector.

elevLength

Length of the elevated/gapped pattern-match region.

nonElev

Length of the non-elevated/gapped pattern-match region.

bestMatchInfo

The information associated with the current best pattern-match for the contig/chunk being assessed.

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

minSize

The minimum size (in bp) of elevation or gap patterns. Default is 10000.

elevOrGap

Pattern-matching on 'elevation' or 'gap' pattern.


Gene annotation plot

Description

Plot read coverage and location of gene annotations that match the keywords and search criteria for contig/chunk currently being assessed

Usage

geneAnnotationPlot(
  geneAnnotSubset,
  keywords,
  pileupSubset,
  colIdx,
  startbpRange,
  endbpRange,
  elevRatio,
  pattern,
  windowSize,
  chunkSize,
  mode
)

Arguments

geneAnnotSubset

Subset of gene annotations to be plotted

keywords

The key-word(s) used for the search.

pileupSubset

A subset of the pileup associated with the contig/chunk being assessed

colIdx

The column index 'gene' or 'product' column

startbpRange

The basepair at which the search is started if a 'specific' search is used

endbpRange

The basepair at which the search is ended if a 'specific' search is used

elevRatio

The maximum/minimum values of the pattern-match

pattern

The pattern-match information associated with the contig/chunk being assessed

windowSize

The number of basepairs to average read coverage values over.


Search for gene annotations on classified contigs/chunks

Description

Search contigs classified with ProActive for gene-annotations that match a provided key-word(s). Outputs read coverage plots for contigs/chunks with matching annotations.

Usage

geneAnnotationSearch(
  ProActiveResults,
  pileup,
  gffTSV,
  geneOrProduct,
  keyWords,
  inGapOrElev = FALSE,
  bpRange = 0,
  elevFilter,
  saveFilesTo,
  verbose = TRUE
)

Arguments

ProActiveResults

The output from 'ProActive()'.

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

gffTSV

A .gff file (TSV) containing gene predictions associated with the .fasta file used to generate the pileup.

geneOrProduct

"gene" or "product". Search for keyWords associated with genes or gene products.

keyWords

The keyWord(s) to search for. Case independent. Searches will return the string that contains the matching keyWord. KeyWord(s) must be in quotes, comma-separated, and surrounded by c() i.e( c("antibiotic", "resistance", "drug") )

inGapOrElev

TRUE or FALSE. If TRUE, only search for gene-annotations in the gap/elevation region of the pattern-match. Default is FALSE (i.e search the entire contig/chunk for the gene annotation key-words)

bpRange

If 'inGapOrElev' = TRUE, the user may specify the region (in base pairs) that should be searched to the left and right of the gap/elevation region. Default is 0.

elevFilter

Optional, only plot results with pattern-matches that achieved an elevation ratio (max/min) greater than the specified values. Default is no filter.

saveFilesTo

Optional, Provide a path to the directory you wish to save output to. A folder will be made within the provided directory to store results.

verbose

TRUE or FALSE. Print progress messages to console. Default is TRUE.

Value

list of ggplot objects

Examples

geneAnnotMatches <- geneAnnotationSearch(sampleMetagenomeResults, sampleMetagenomePileup,
                                          sampleMetagenomegffTSV, geneOrProduct="product",
                                          keyWords=c("toxin", "drug", "resistance", "phage"))

'chunk' genomes

Description

Subset genome pileup into chunks for pattern-matching

Usage

genomeChunks(pileup, chunkSize)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

chunkSize

If 'mode'="genome" OR if 'mode'="metagenome" and 'chunkContigs'=TRUE, chunk the genome or contigs, respectively, into smaller subsets for pattern-matching. ‘chunkSize' determines the size (in bp) of each ’chunk'. Default is 50000.


Link pattern-matches on contig/genome chunks

Description

Detect partial gap/elevation pattern matches that fall on the edges of chunked genomes/contigs that may be part of the same pattern prior to chunking

Usage

linkChunks(bestMatchList, pileup, windowSize, mode, verbose)

Arguments

bestMatchList

A list containing pattern-match information associated with all contigs/chunks classified by 'ProActive()' pattern-matching

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

windowSize

The number of basepairs to average read coverage values over.

mode

Either "genome" or "metagenome"

verbose

TRUE or FALSE. Print progress messages to console. Default is TRUE.


No read coverage pattern

Description

Assess whether a contig/chunk does not have an elevated/gapped read coverage pattern. A horizontal line at the mean or median coverage should be the best match if the contig/chunk read coverage is not gapped or elevated.

Usage

noPattern(pileupSubset)

Arguments

pileupSubset

A subset of the read coverage dataset that pertains only to the contig currently being assessed


Controller function for partial elevation/gap pattern-matching

Description

Builds partial elevation/gap pattern-match for patterns going off both the left and right sides of the contig/chunk, shrinks the width, and collects best match information

Usage

partialElevGap(pileupSubset, windowSize, minSize, maxSize)

Arguments

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

minSize

The minimum size (in bp) of elevation or gap patterns. Default is 10000.

maxSize

The maximum size (in bp) of elevation or gap patterns. Default is NA (i.e. no maximum).


Shrink the width of partial elevation and gap patterns

Description

Remove values from gapped/elevated region in the pattern-match vector until it reaches the 'minSize'.

Usage

partialElevGapShrink(
  minCov,
  windowSize,
  maxCov,
  elevLength,
  nonElev,
  bestMatchInfo,
  pileupSubset,
  minSize,
  leftOrRight
)

Arguments

minCov

The minimum value of the pattern-match vector.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

maxCov

The maximum value of the pattern-match vector.

elevLength

Length of the elevated/gapped pattern-match region.

nonElev

Length of the non-elevated/gapped pattern-match region.

bestMatchInfo

The information associated with the current best pattern-match for the contig/chunk being assessed.

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

minSize

The minimum size (in bp) of elevation or gap patterns. Default is 10000.

leftOrRight

'Left' or 'Right' partial gap/elevation pattern.


Builds pattern-match vectors

Description

Builds the pattern-match (vector) associated with each contig/chunk for visualization.

Usage

patternBuilder(pileupSubset, bestMatchInfo)

Arguments

pileupSubset

A subset of the pileup that pertains only to the contig/chunk currently being assessed.

bestMatchInfo

The information associated with the current best pattern-match for the contig/chunk being assessed.


Controller function for pattern-matching

Description

Creates the pileupSubset, representative of one contig/chunk, used as input for each individual pattern-matching function. After the information associated with the best match for each pattern is obtained, the pattern-match with the lowest mean absolute difference (match-score) is used for classification.

Usage

patternMatcher(
  pileup,
  windowSize,
  minSize,
  maxSize,
  mode,
  minContigLength,
  verbose
)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

windowSize

The number of basepairs to average read coverage values over.

minSize

The minimum size (in bp) of elevation or gap patterns. Default is 10000.

maxSize

The maximum size (in bp) of elevation or gap patterns. Default is NA (i.e. no maximum).

mode

Either "genome" or "metagenome".

minContigLength

The minimum contig/chunk size (in bp) to perform pattern-matching on. Default is 25000.

verbose

TRUE or FALSE. Print progress messages to console. Default is TRUE.


Full elevation/gap pattern translator

Description

Translates full elevation/gap patterns across contigs/chunks 1000bp at a time. Translation stops when the elevation pattern is 5000bp from the end of the contig/chunk.

Usage

patternTranslator(contigCov, bestMatchInfo, windowSize, pattern, elevOrGap)

Arguments

contigCov

The read coverages that pertain to the pileupSubset

bestMatchInfo

The information associated with the current best pattern-match for the contig/chunk being assessed.

windowSize

The number of basepairs to average read coverage values over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.

pattern

A vector containing the values associated with the pattern-match

elevOrGap

Pattern-matching on 'elevation' or 'gap' pattern.


Reformat input pileup file

Description

Place columns in correct order, clean accessions by removing text after white space, and name columns

Usage

pileupFormatter(pileup, mode)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

mode

Either "genome" or "metagenome"


Plot results of 'ProActive()' pattern-matching

Description

Plot read coverage of contigs/chunks with detected gaps and elevations and their associated pattern-match.

Usage

plotProActiveResults(pileup, ProActiveResults, elevFilter, saveFilesTo)

Arguments

pileup

A .txt file containing mapped sequencing read coverages averaged over 100 bp windows/bins.

ProActiveResults

The output from 'ProActive()'.

elevFilter

Optional, only plot results with pattern-matches that achieved an elevation ratio (max/min) greater than the specified values. Default is no filter.

saveFilesTo

Optional, Provide a path to the directory you wish to save output to. A folder will be made within the provided directory to store results.

Value

A list containing ggplot objects

Examples

ProActivePlots <- plotProActiveResults(sampleMetagenomePileup,
                                       sampleMetagenomeResults)

Removes 'NoPattern' classifications from best match list

Description

Removes 'NoPattern' classifications from the list of pattern-match information associated with the best pattern-matches for each contig/chunk

Usage

removeNoPatterns(bestMatchList)

Arguments

bestMatchList

A list containing pattern-match information associated with all contigs/chunks classified by 'ProActive()' pattern-matching


sampleGenomePileup

Description

A pileup file generated during read mapping to the *Salmonella enterica* LT2 genome. Report...

Usage

sampleGenomePileup

Format

## 'sampleGenomePileup' A data frame with 48,575 rows and 4 columns:

V1

Accession

V2

Mapped read coverage averaged over a 100 bp window size

V3

Starting position (bp) of each 100 bp window. Starts from 100.

V4

Starting position (bp) of each 100 bp window. Starts from 0.

Details

This dataset was generated by extracting DNA from a culture of *Salmonella enterica* LT2 (LT2) infected with phage P22. The DNA was shotgun sequenced with Illumina (paired-end mode, 150 bp reads). The sequencing reads were mapped to the LT2 reference genome (NCBI RefSeq NC_003197.2). The bbmap.sh bincov parameter with covbinsize=100 was used to create a pileup file with 100 bp windows.

Source

<https://pubmed.ncbi.nlm.nih.gov/25608871/>


sampleGenomegffTSV

Description

Gene annotations associated with the genome in the sampleGenomePileup Report...

Usage

sampleGenomegffTSV

Format

## 'sampleGenomegffTSV' A data frame with 85,575 rows and 9 columns:

V1

seqid

V2

source

V3

type

V4

start

V5

end

V6

score

V7

strand

V8

phase

V9

attributes

Details

This is a standard .gff file format. The .gff file was generated by running PROKKA with default parameters on the *Salmonella enterica* LT2 genome sequence (NCBI RefSeq NC_003197.2) associated with the sampleGenomePileup in the ProActive package.


sampleMetagenomePileup

Description

A subset of contigs from the raw whole-community fraction read coverage pileup file generated during read mapping. Report...

Usage

sampleMetagenomePileup

Format

## 'sampleMetagenomePileup' A data frame with 4,604 rows and 4 columns:

V1

Contig accession

V2

Mapped read coverage averaged over a 100 bp window size

V3

Starting position (bp) of each 100 bp window. Restarts from 0 at the start of each new contig.

V4

Starting position (bp) of each 100 bp window. Does NOT restart at the start of each new contig.

Details

This dataset was generated from a conventional mouse fecal homogenate. The whole-community extracted DNA was sequenced with Illumina (paired-end mode, 150 bp reads) after which the metagenome was assembled. The sequencing reads were mapped to the assembled contigs using BBMap. The bbmap.sh bincov parameter with covbinsize=100 was used to create a pileup file with 100 bp windows. A subset of 10 contigs from the pileup file were selected for this sample dataset. The contigs were chosen because their associated read coverage patterns exemplify ProActive's pattern-matching and characterization functionality across classifications: NODE_1911: elevation off left NODE_1583: elevation off right NODE_1884: gap off right NODE_1255: gap off left NODE_368: full gap NODE_617: elevation full NODE_1625: no pattern

Source

<https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00935-5>


sampleMetagenomeResults

Description

Output of 'ProActiveDetect()' Report...

Usage

sampleMetagenomeResults

Format

## 'sampleMetagenomeResults' A list with 6 objects:

SummaryTable

A table containing all pattern-matching classifications

CleanSummaryTable

A table containing only gap and elevation pattern-match classifications (i.e. noPattern classifications removed)

PatternMatches

A list object containing information needed to visualize the pattern-matches in 'plotProActiveResults()'

FilteredOut

A table containing contigs/chunks that were filtered out for being too small or having too low read coverage

Arguments

A list object containing arguments used for pattern-matching (windowSize, mode, chunkSize, chunkContigs)

GeneAnnotTable

A table containing gene predictions associated with elevated or gapped regions in pattern-matches

Details

This data was generated by running 'ProActiveDetect()' on the sampleMetagenomePileup and sampleMetagenomegffTSV with default parameters.


sampleMetagenomegffTSV

Description

A subset of gene annotations associated with the metagenome in the sampleMetagenomePileup Report...

Usage

sampleMetagenomegffTSV

Format

## 'sampleMetagenomegffTSV' A data frame with 467 rows and 9 columns:

V1

seqid

V2

source

V3

type

V4

start

V5

end

V6

score

V7

strand

V8

phase

V9

attributes

Details

This is a standard .gff file format. The .gff file was generated by running PROKKA with default parameters on the metagenome assembly associated with the sampleMetagenomePileup in the ProActive package. The gff was subset to only include the data associated with the contigs in the sample data subset.

Source

<https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00935-5>