Review, Evaluation, and Directions for Gene-Targeted Assembly for Ecological Analyses of Metagenomes

Shotgun metagenomics has greatly advanced our understanding of microbial communities over the last decade. Metagenomic analyses often include assembly and genome binning, computationally daunting tasks especially for big data from complex environments such as soil and sediments. In many studies, however, only a subset of genes and pathways involved in specific functions are of interest; thus, it is not necessary to attempt global assembly. In addition, methods that target genes can be computationally more efficient and produce more accurate assembly by leveraging rich databases, especially for those genes that are of broad interest such as those involved in biogeochemical cycles, biodegradation, and antibiotic resistance or used as phylogenetic markers. Here, we review six gene-targeted assemblers with unique algorithms for extracting and/or assembling targeted genes: Xander, MegaGTA, SAT-Assembler, HMM-GRASPx, GenSeed-HMM, and MEGAN. We tested these tools using two datasets with known genomes, a synthetic community of artificial reads derived from the genomes of 17 bacteria, shotgun sequence data from a mock community with 48 bacteria and 16 archaea genomes, and a large soil shotgun metagenomic dataset. We compared assemblies of a universal single copy gene (rplB) and two N cycle genes (nifH and nirK). We measured their computational efficiency, sensitivity, specificity, and chimera rate and found Xander and MegaGTA, which both use a probabilistic graph structure to model the genes, have the best overall performance with all three datasets, although MEGAN, a reference matching assembler, had better sensitivity with synthetic and mock community members chosen from its reference collection. Also, Xander and MegaGTA are the only tools that include post-assembly scripts tuned for common molecular ecology and diversity analyses. Additionally, we provide a mathematical model for estimating the probability of assembling targeted genes in a metagenome for estimating required sequencing depth.

[1]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[2]  J. Y. Yen Finding the K Shortest Loopless Paths in a Network , 1971 .

[3]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[4]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[5]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[8]  Ross Lippert,et al.  A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data , 2005, J. Comput. Biol..

[9]  J. Y. Yen,et al.  Finding the K Shortest Loopless Paths in a Network , 2007 .

[10]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[11]  G. Olsen,et al.  Critical Evaluation of Two Primers Commonly Used for Amplification of Bacterial 16S rRNA Genes , 2008, Applied and Environmental Microbiology.

[12]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[13]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[14]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[15]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[16]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[17]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[18]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[19]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[20]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[21]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[22]  Andreas Wilke,et al.  The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome , 2012, GigaScience.

[23]  Yasubumi Sakakibara,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2012, Nucleic acids research.

[24]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[25]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[26]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[27]  K. Konstantinidis,et al.  Unexpected nondenitrifier nitrous oxide reductase gene diversity and abundance in soils , 2012, Proceedings of the National Academy of Sciences.

[28]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[29]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[30]  A. Klindworth,et al.  Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies , 2012, Nucleic acids research.

[31]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[32]  Jordan A. Fish,et al.  FunGene: the functional gene pipeline and repository , 2013, Front. Microbiol..

[33]  Luis Miguel Rodriguez-Rojas,et al.  Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets , 2014, Bioinform..

[34]  Yuan Zhang,et al.  A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data , 2014, PLoS Comput. Biol..

[35]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[36]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[37]  M. Pop,et al.  The Theory and Practice of Genome Sequence Assembly. , 2015, Annual review of genomics and human genetics.

[38]  Shibu Yooseph,et al.  SFA-SPA: a suffix array based short peptide assembler for metagenomic data , 2015, Bioinform..

[39]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[40]  J. Tiedje,et al.  Microbial Community Analysis with Ribosomal Gene Fragments from Shotgun Metagenomes , 2015, Applied and Environmental Microbiology.

[41]  Jordan A. Fish,et al.  Xander: employing a novel method for efficient gene-targeted metagenomic assembly , 2015, Microbiome.

[42]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[43]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[44]  Shibu Yooseph,et al.  GRASPx: efficient homolog-search of short peptide metagenome database through simultaneous alignment and assembly , 2016, BMC Bioinformatics.

[45]  J. Lennon,et al.  Scaling laws predict global microbial diversity , 2016, Proceedings of the National Academy of Sciences.

[46]  Daniel H. Huson,et al.  MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data , 2016, PLoS Comput. Biol..

[47]  Alexander Schönhuth,et al.  Snowball: strain aware gene assembly of metagenomes , 2015, Bioinform..

[48]  Robert C. Edgar,et al.  UCHIME2: improved chimera prediction for amplicon sequencing , 2016, bioRxiv.

[49]  Eugene W. Myers A history of DNA sequence assembly , 2016, it Inf. Technol..

[50]  E. Myers,et al.  A history of DNA sequence assembly , 2016, it Inf. Technol..

[51]  A. Gruber,et al.  GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and its Application in Alpavirinae Viral Discovery from Metagenomic Data , 2016, Front. Microbiol..

[52]  Yuxing Liao,et al.  ECOD: new developments in the evolutionary classification of domains , 2016, Nucleic Acids Res..

[53]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[54]  Alejandro Reyes,et al.  Use of profile hidden Markov models in viral discovery: current insights , 2017 .

[55]  K. Konstantinidis,et al.  ROCker: accurate detection and quantification of target genes in short-read metagenomic data sets by modeling sliding-window bitscores , 2016, Nucleic acids research.

[56]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[57]  Tak Wah Lam,et al.  MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs , 2017, BMC Bioinformatics.

[58]  C. Brown,et al.  Evaluating Metagenome Assembly on a Simple Defined Community with Many Strain Variants , 2017, bioRxiv.

[59]  Rohan B. H. Williams,et al.  Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads , 2017, Microbiome.

[60]  J. Tiedje,et al.  Comparing faster evolving rplB and rpsC versus SSU rRNA for improved microbial community resolution , 2018, bioRxiv.

[61]  Blair D. Sullivan,et al.  Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity , 2018 .

[62]  James R Cole,et al.  Nonpareil 3: Fast Estimation of Metagenomic Coverage and Sequence Diversity , 2018, mSystems.