Reading the Underlying Information From Massive Metagenomic Sequencing Data

Microorganisms are everywhere. Recent studies showed that the mixture of microbes or the microbiome on the human body plays important roles in human physiology and diseases. Metagenomic sequencing is a key technology for studying microbiomes. It produces massive amounts of data in the form of short sequencing reads. A single metagenomic sample can contain 10 7 to 10 8 reads of about 100-nucleotide (nt) length each in a typical shotgun metagenomic sequencing study. They contain rich information about microbiomes and their functions, but reading out those information from the huge highly fragmented data has multiple challenges for mathematical models, bioinformatics methods, and computer algorithms. In this paper, we review the basic bioinformatics tasks and existing methods in processing and analyzing metagenomic data, and discuss remaining open challenges and practical observations. The aim of the paper is to provide readers a whole picture of metagenomic data processing and analysis, and a reference and perspective to start with for computational scientists who are interested in this exciting field.

[1]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[2]  Mihai Pop,et al.  Bioinformatics for the Human Microbiome Project , 2012, PLoS Comput. Biol..

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[5]  Li C. Xia,et al.  Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads , 2011, PloS one.

[6]  Ron Milo,et al.  Are We Really Vastly Outnumbered? Revisiting the Ratio of Bacterial to Host Cells in Humans , 2016, Cell.

[7]  Michael P. Cummings,et al.  A comparative evaluation of sequence classification programs , 2012, BMC Bioinformatics.

[8]  J. L. Green,et al.  A unified initiative to harness Earth's microbiomes , 2015, Science.

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[11]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[12]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[13]  P. Turnbaugh,et al.  Xenobiotics Shape the Physiology and Gene Expression of the Active Human Gut Microbiome , 2013, Cell.

[14]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[15]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[16]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[17]  Cynthia L Sears,et al.  Microbes, microbiota, and colon cancer. , 2014, Cell host & microbe.

[18]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[19]  F. Shanahan,et al.  Categorization of the gut microbiota: enterotypes or gradients? , 2012, Nature Reviews Microbiology.

[20]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[21]  Medha Bhagwat,et al.  Using BLAT to find sequence similarity in closely related genomes. , 2012, Current protocols in bioinformatics.

[22]  Jens Roat Kultima,et al.  An integrated catalog of reference genes in the human gut microbiome , 2014, Nature Biotechnology.

[23]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[24]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[25]  Jacques Monod,et al.  The operon: a group of genes whose expression is co-ordinated by an operator. , 1960 .

[26]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[27]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[28]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[29]  Katharina J. Hoff,et al.  Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[30]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[31]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[32]  S. Bordenstein,et al.  Rethinking heritability of the microbiome , 2015, Science.

[33]  Raymond Lo,et al.  Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities , 2015, BMC Bioinformatics.

[34]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[35]  Oscar P. Kuipers,et al.  The relative value of operon predictions , 2008, Briefings Bioinform..

[36]  J. Nicholson,et al.  Host-Gut Microbiota Metabolic Interactions , 2012, Science.

[37]  P. Ashton,et al.  MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island , 2014, Nature Biotechnology.

[38]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[39]  Bernhard Y. Renard,et al.  Analyzing genome coverage profiles with applications to quality control in metagenomics , 2013, Bioinform..

[40]  D. Chia,et al.  Variations of oral microbiota are associated with pancreatic diseases including pancreatic cancer , 2011, Gut.

[41]  Katherine H. Huang,et al.  Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[42]  S. Sørensen,et al.  Quantitative Metagenomic Analyses Based on Average Genome Size Normalization , 2011, Applied and Environmental Microbiology.

[43]  Peer Bork,et al.  MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit , 2012, PloS one.

[44]  K. Hoff Gene prediction in metagenomic sequencing reads , 2009 .

[45]  Allyson L. Byrd,et al.  Biogeography and individuality shape function in the human skin metagenome , 2014, Nature.

[46]  P. Bork,et al.  Get the most out of your metagenome: computational analysis of environmental sequence data. , 2007, Current opinion in microbiology.

[47]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[48]  Nicole Dubilier,et al.  Microbiology: Create a global microbiome effort , 2015, Nature.

[49]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[50]  Hongfei Cui,et al.  Alignment-free supervised classification of metagenomes by recursive SVM , 2013, BMC Genomics.

[51]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[53]  D. Willner,et al.  Metagenomic signatures of 86 microbial and viral metagenomes. , 2009, Environmental microbiology.

[54]  Antti Honkela,et al.  Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[55]  Maitreya J. Dunham,et al.  Species-Level Deconvolution of Metagenome Assemblies with Hi-C–Based Contact Probability Maps , 2014, G3: Genes, Genomes, Genetics.

[56]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[57]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[58]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[59]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[60]  Hideaki Tanaka,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[61]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[62]  Donovan Parks,et al.  GroopM: an automated tool for the recovery of population genomes from related metagenomes , 2014, PeerJ.

[63]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[64]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[65]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[66]  T. Sharpton An introduction to the analysis of shotgun metagenomic data , 2014, Front. Plant Sci..

[67]  Herbert Tilg,et al.  Gut microbiome development along the colorectal adenoma-carcinoma sequence , 2015 .

[68]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[69]  Yongan Zhao,et al.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011, Bioinform..

[70]  Beiwen Zheng,et al.  Alterations of the human gut microbiome in liver cirrhosis , 2014, Nature.

[71]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[72]  Chao Xie,et al.  A poor man’s BLASTX—high-throughput metagenomic protein database search using PAUDA , 2013, Bioinform..

[73]  L. Ursell,et al.  Gut Microbiomes of Malawian Twin Pairs Discordant for Kwashiorkor , 2013, Science.

[74]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[75]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[76]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[77]  T. Borody,et al.  Fecal microbiota transplantation and emerging applications , 2012, Nature Reviews Gastroenterology &Hepatology.

[78]  Zaid Abdo,et al.  Differences in the composition of vaginal microbial communities found in healthy Caucasian and black women , 2007, The ISME Journal.

[79]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[80]  Siu-Ming Yiu,et al.  Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers , 2009, BMC Bioinformatics.

[81]  Yunfeng Yang,et al.  Annual periodicity in planktonic bacterial and archaeal community composition of eutrophic Lake Taihu , 2015, Scientific Reports.

[82]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[83]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[84]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[85]  Rob Knight,et al.  ConStrains identifies microbial strains in metagenomic datasets , 2015, Nature Biotechnology.

[86]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[87]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[88]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[89]  Steven Salzberg,et al.  Mugsy: fast multiple alignment of closely related whole genomes , 2010, Bioinform..

[90]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[91]  J. Segre,et al.  The human microbiome: our second genome. , 2012, Annual review of genomics and human genetics.

[92]  J. Qi,et al.  Metagenomic sequencing reveals microbiota and its functional potential associated with periodontal disease , 2013, Scientific Reports.

[93]  J. Gordon,et al.  Honor Thy Gut Symbionts Redux , 2012, Science.

[94]  P. Bork,et al.  Richness of human gut microbiome correlates with metabolic markers , 2013, Nature.

[95]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[96]  Gregory Vey Metagenomic Guilt by Association: An Operonic Perspective , 2013, PloS one.

[97]  Brian C. Thomas,et al.  Community-wide analysis of microbial genome sequence signatures , 2009, Genome Biology.

[98]  J. Petrosino,et al.  Microbiota Modulate Behavioral and Physiological Abnormalities Associated with Neurodevelopmental Disorders , 2013, Cell.

[99]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[100]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[101]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[102]  Jenna M. Lang,et al.  Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products , 2014, PeerJ.