BLAST Tree: Fast Filtering for Genomic Sequence Classification

With the advent of next-generation sequencing and culture-independent methods, we now are accumulating an enormous amount of metagenomic data from microbial communities. These data sets are large, hard to assemble, and might encode rare or novel proteins, posing new computational challenges for protein homology search. This paper presents a novel protein homology search algorithm that combines the salient features of pairwise sequence alignment programs such as Blast and protein family based tools such as Hmmer. It is optimized for protein annotation in metagenomic data sets because: 1) it is fast, 2) it can classify short protein fragments encoded by individual sequence reads, 3) it can find homologs to novel or rare protein families when there is not enough member sequences to build a probabilistic model. Our algorithm builds a new indexing data structure called BlastTree, which can index a large sequence family database because of our effective compression techniques. In addition, BlastTree fully exploits sequence family membership information to improve homology search sensitivity. When the BlastTree Search algorithm is incorporated into Hmmer, it runs in a fraction of the time with comparable quality.

[1]  Anna R Panchenko,et al.  Finding weak similarities between proteins by sequence profile comparison. , 2003, Nucleic acids research.

[2]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[3]  E. Koonin,et al.  Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. , 2000, Science.

[4]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[5]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[6]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[7]  Jeremy Buhler,et al.  Designing patterns for profile HMM search , 2007, Bioinform..

[8]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[9]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[10]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[11]  Daniel Rokhsar,et al.  Reverse Methanogenesis: Testing the Hypothesis with Environmental Genomics , 2004, Science.

[12]  James R. Cole,et al.  Sequence Homology Search Based on Database Indexing Using the Profile Hidden Markov Model , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[13]  Elon Portugaly,et al.  HMMERHEAD-Accelerating HMM Searches On Large Databases , 2004 .

[14]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[15]  M. Sternberg,et al.  Benchmarking PSI-BLAST in genome annotation. , 1999, Journal of molecular biology.

[16]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[17]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[21]  Jeremy Buhler,et al.  Designing Patterns and Profiles for Faster HMM Search , 2009, TCBB.

[22]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[23]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[24]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[25]  Christoph W Sensen,et al.  Osprey: a comprehensive tool employing novel methods for the design of oligonucleotides for DNA sequencing and microarrays. , 2004, Nucleic acids research.

[26]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[27]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[28]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[29]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[30]  Folker Meyer,et al.  37. The Metagenomics RAST Server: A Public Resource for the Automatic Phylogenetic and Functional Analysis of Metagenomes , 2011 .

[31]  Philip S. Yu,et al.  Accelerating approximate subsequence search on large protein sequence databases , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[32]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[33]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[34]  William Noble Grundy,et al.  Family-based homology detection via pairwise sequence comparison , 1998, RECOMB '98.

[35]  E. Delong,et al.  Community Genomics Among Stratified Microbial Assemblages in the Ocean's Interior , 2006, Science.

[36]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.