Machine learning for metagenomics: methods and tools

Owing to the complexity and variability of metagenomic studies, modern machine learning approaches have seen increased usage to answer a variety of question encompassing the full range of metagenomic NGS data analysis. We review here the contribution of machine learning techniques for the field of metagenomics, by presenting known successful approaches in a unified framework. This review focuses on five important metagenomic problems: OTU-clustering, binning, taxonomic profling and assignment, comparative metagenomics and gene prediction. For each of these problems, we identify the most prominent methods, summarize the machine learning approaches used and put them into perspective of similar methods. We conclude our review looking further ahead at the challenge posed by the analysis of interactions within microbial communities and different environments, in a field one could call "integrative metagenomics".

[1]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[2]  Huzefa Rangwala,et al.  Metagenomic Taxonomic Classification Using Extreme Learning Machines , 2012, J. Bioinform. Comput. Biol..

[3]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[4]  J. Fuhrman General Distributions and the 'rare Biosphere' Microbial Community Structure and Its Functional Implications Review Insight , 2022 .

[5]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[6]  David R. Kelley,et al.  Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering , 2011, Nucleic acids research.

[7]  Hongfei Cui,et al.  Alignment-free supervised classification of metagenomes by recursive SVM , 2013, BMC Genomics.

[8]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[9]  Fern,et al.  Assessment of Fungal Diversity in the Environment using Metagenomics: a Decade in Review , 2013 .

[10]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[11]  Matthew Fraser,et al.  EBI metagenomics—a new resource for the analysis and archiving of metagenomic data , 2013, Nucleic Acids Res..

[12]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[13]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[14]  Sanghyuk Lee,et al.  Accurate quantification of transcriptome from RNA-Seq data by effective length normalization , 2010, Nucleic Acids Res..

[15]  G. Narasimhan,et al.  An eco-informatics tool for microbial community studies: supervised classification of Amplicon Length Heterogeneity (ALH) profiles of 16S rRNA. , 2006, Journal of microbiological methods.

[16]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[17]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[18]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[19]  Simon Foucart,et al.  WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification , 2014, PloS one.

[20]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[21]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[22]  Katharina J. Hoff,et al.  Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[23]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[24]  E. G. Shpaer The secondary structure of mRNAs from Escherichia coli: its possible role in increasing the accuracy of translation , 1985, Nucleic Acids Res..

[25]  Robert G. Beiko,et al.  Classifying short genomic fragments from novel lineages using composition and homology , 2011, BMC Bioinformatics.

[26]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[27]  Amit Roy,et al.  Molecular Markers in Phylogenetic Studies-A Review , 2014 .

[28]  Graziano Pesole,et al.  Reference databases for taxonomic assignment in metagenomics , 2012, Briefings Bioinform..

[29]  Eran Halperin,et al.  Joint Analysis of Multiple Metagenomic Samples , 2012, PLoS Comput. Biol..

[30]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[31]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[32]  N. Pace,et al.  Identifying microbial diversity in the natural environment: a molecular phylogenetic approach. , 1996, Trends in biotechnology.

[33]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[34]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[35]  Duccio Cavalieri,et al.  MICCA: a complete and accurate software for taxonomic profiling of metagenomic data , 2015, Scientific Reports.

[36]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[37]  Alexander Keller,et al.  The ITS2 Database III—sequences and structures for phylogeny , 2009, Nucleic Acids Res..

[38]  Jukka Corander,et al.  Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model , 2014, Statistical applications in genetics and molecular biology.

[39]  Michael P. Cummings,et al.  A comparative evaluation of sequence classification programs , 2012, BMC Bioinformatics.

[40]  V. Kunin,et al.  Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. , 2009, Environmental microbiology.

[41]  P. Bork,et al.  Tara Oceans studies plankton at planetary scale , 2015, Science.

[42]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[43]  François Enault,et al.  Assessment of viral community functional potential from viral metagenomes may be hampered by contamination with cellular sequences , 2013, Open Biology.

[44]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[45]  Xiao Sun,et al.  DectICO: an alignment-free supervised metagenomic classification method based on feature extraction and dynamic selection , 2015, BMC Bioinformatics.

[46]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[47]  Jean-Philippe Vert,et al.  Large-scale machine learning for metagenomics sequence classification , 2015, Bioinform..

[48]  Bo Liu,et al.  Computational Metagenomics: Network, Classification and Assembly , 2012 .

[49]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[50]  Thierry Candresse,et al.  Finding and identifying the viral needle in the metagenomic haystack: trends and challenges , 2015, Front. Microbiol..

[51]  Antti Honkela,et al.  Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[52]  Mark Blaxter,et al.  Defining operational taxonomic units using DNA barcode data , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[53]  Paolo Fontana,et al.  Bioinformatic approaches for functional annotation and pathway inference in metagenomics data , 2012, Briefings Bioinform..

[54]  Jukka Corander,et al.  ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition , 2015, PloS one.

[55]  Ye Deng,et al.  Functional Molecular Ecological Networks , 2010, mBio.

[56]  Edward C. Uberbacher,et al.  Gene and translation initiation site prediction in metagenomic sequences , 2012, Bioinform..

[57]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[58]  Natalie DeWitt,et al.  Developmental biology: Transdifferentiation under scrutiny , 2002, Nature Reviews Genetics.

[59]  Murray Wolinsky,et al.  Response to Comment by Volkov et al. on "Computational Improvements Reveal Great Bacterial Diversity and High Metal Toxicity in Soil" , 2006, Science.

[60]  Katharina J. Hoff,et al.  BMC Bioinformatics BioMed Central Methodology article Gene prediction in metagenomic fragments: A large scale machine , 2008 .

[61]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[62]  Frederic D Bushman,et al.  Hypervariable loci in the human gut virome , 2012, Proceedings of the National Academy of Sciences.

[63]  Daniel H. Huson,et al.  Visual and statistical comparison of metagenomes , 2009, Bioinform..

[64]  Andreas Wilke,et al.  Short-read reading-frame predictors are not created equal: sequence error causes loss of signal , 2012, BMC Bioinformatics.

[65]  Paola Bonizzoni,et al.  Further Steps in TANGO: improved taxonomic assignment in metagenomics , 2014, Bioinform..

[66]  Khalid Sayood,et al.  A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences , 2010, BMC Bioinformatics.

[67]  E. Boyle,et al.  A simple and efficient method for concentration of ocean viruses by chemical flocculation , 2011, Environmental microbiology reports.

[68]  Gail L. Rosen,et al.  Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing , 2013, Bioinform..

[69]  Vincent Ferretti,et al.  Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification , 2014, Bioinform..

[70]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[71]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[72]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[73]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[74]  Brian C. Thomas,et al.  Community-wide analysis of microbial genome sequence signatures , 2009, Genome Biology.

[75]  J. Celton,et al.  Deep sequencing analysis of viruses infecting grapevines: Virome of a vineyard. , 2010, Virology.

[76]  N. Pace,et al.  The Analysis of Natural Microbial Populations by Ribosomal RNA Sequences , 1986 .

[77]  S. Tringe,et al.  MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm , 2014, Microbiome.

[78]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[79]  Afiahayati,et al.  MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning , 2014, DNA research : an international journal for rapid publication of reports on genes and genomes.

[80]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[81]  Zhenqiu Liu,et al.  Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data , 2011, Bioinform..

[82]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[83]  Martin Wu,et al.  Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units , 2013, Nucleic acids research.

[84]  Kessy Abarenkov,et al.  Fungal community analysis by high-throughput sequencing of amplified markers – a user's guide , 2013, The New phytologist.

[85]  Tao Jiang,et al.  Separating metagenomic short reads into genomes via clustering , 2012, Algorithms for Molecular Biology.

[86]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[87]  Brian C. Thomas,et al.  Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization , 2013, Genome research.

[88]  Patrick D. Schloss,et al.  Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis , 2011, Applied and Environmental Microbiology.

[89]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[90]  M. A. Gorovsky,et al.  Unusual features of transcribed and translated regions of the histone H4 gene family of Tetrahymena thermophila , 1987, Nucleic Acids Res..

[91]  Katherine H. Huang,et al.  Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[92]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[93]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[94]  Mikael Skoglund,et al.  SEK: sparsity exploiting k-mer-based estimation of bacterial community composition , 2014, Bioinform..

[95]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[96]  Tao Jiang,et al.  Phylogeny-based classification of microbial communities , 2014, Bioinform..

[97]  Weizhong Li,et al.  Analysis and comparison of very large metagenomes with fast clustering and functional annotation , 2009, BMC Bioinformatics.

[98]  Vineet K. Sharma,et al.  16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets , 2015, PloS one.

[99]  Ian Clark,et al.  Metagenomic comparison of direct and indirect soil DNA extraction approaches. , 2011, Journal of microbiological methods.

[100]  A. Mchardy,et al.  The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences , 2012, PloS one.

[101]  Peter A. Flach,et al.  Machine Learning , 2012 .

[102]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[103]  K. Schleifer,et al.  Bacterial phylogeny based on 16S and 23S rRNA sequence analysis. , 1994, FEMS microbiology reviews.

[104]  B. Hurwitz,et al.  Evaluation of methods to concentrate and purify ocean virus communities through comparative, replicated metagenomics , 2013, Environmental microbiology.

[105]  G. Gloor,et al.  High throughput sequencing methods and analysis for microbiome research. , 2013, Journal of microbiological methods.

[106]  Lynn K. Carmichael,et al.  Evaluation of 16S rDNA-Based Community Profiling for Human Microbiome Research , 2012, PloS one.

[107]  Sanne Abeln,et al.  Comparing clustering and pre-processing in taxonomy analysis , 2012, Bioinform..

[108]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[109]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[110]  Anton van Leeuwenhoek Environmental Shotgun Sequencing : Its Potential and Challenges for Studying the Hidden World of Microbes , 2007 .

[111]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[112]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[113]  Alice C McHardy,et al.  What's in the mix: phylogenetic classification of metagenome sequence samples. , 2007, Current opinion in microbiology.

[114]  Torsten Thomas,et al.  Selective Extraction of Bacterial DNA from the Surfaces of Macroalgae , 2008, Applied and Environmental Microbiology.

[115]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[116]  Peter Meinicke,et al.  Mixture models for analysis of the taxonomic composition of metagenomes , 2011, Bioinform..

[117]  Jukka Corander,et al.  Bayesian estimation of bacterial community composition from 454 sequencing data , 2012, Nucleic acids research.

[118]  Holly M. Bik,et al.  PhyloSift: phylogenetic analysis of genomes and metagenomes , 2014, PeerJ.

[119]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[120]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[121]  Frank Oliver Glöckner,et al.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective , 2012, Briefings Bioinform..

[122]  Jens Roat Kultima,et al.  Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes , 2014, Nature Biotechnology.

[123]  Peer Bork,et al.  SmashCommunity: a metagenomic annotation and analysis tool , 2010, Bioinform..

[124]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[125]  John C. Wooley,et al.  Metagenomics: Facts and Artifacts, and Computational Challenges , 2010, Journal of Computer Science and Technology.