论文信息 - Reading the Underlying Information From Massive Metagenomic Sequencing Data

Reading the Underlying Information From Massive Metagenomic Sequencing Data

Microorganisms are everywhere. Recent studies showed that the mixture of microbes or the microbiome on the human body plays important roles in human physiology and diseases. Metagenomic sequencing is a key technology for studying microbiomes. It produces massive amounts of data in the form of short sequencing reads. A single metagenomic sample can contain 10 7 to 10 8 reads of about 100-nucleotide (nt) length each in a typical shotgun metagenomic sequencing study. They contain rich information about microbiomes and their functions, but reading out those information from the huge highly fragmented data has multiple challenges for mathematical models, bioinformatics methods, and computer algorithms. In this paper, we review the basic bioinformatics tasks and existing methods in processing and analyzing metagenomic data, and discuss remaining open challenges and practical observations. The aim of the paper is to provide readers a whole picture of metagenomic data processing and analysis, and a reference and perspective to start with for computational scientists who are interested in this exciting field.

[1] James R. Cole,et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[2] Mihai Pop,et al. Bioinformatics for the Human Microbiome Project , 2012, PLoS Comput. Biol..

[3] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4] Alexander F. Auch,et al. MEGAN analysis of metagenomic data. , 2007, Genome research.

[5] Li C. Xia,et al. Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads , 2011, PloS one.

[6] Ron Milo,et al. Are We Really Vastly Outnumbered? Revisiting the Ratio of Bacterial to Host Cells in Humans , 2016, Cell.

[7] Michael P. Cummings,et al. A comparative evaluation of sequence classification programs , 2012, BMC Bioinformatics.

[8] J. L. Green,et al. A unified initiative to harness Earth's microbiomes , 2015, Science.

[9] D. J. Wheeler,et al. A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10] M. Pop,et al. Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[11] Naryttza N. Diaz,et al. The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[12] Qiang Feng,et al. A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[13] P. Turnbaugh,et al. Xenobiotics Shape the Physiology and Gene Expression of the Active Human Gut Microbiome , 2013, Cell.

[14] P. Bork,et al. Enterotypes of the human gut microbiome , 2011, Nature.

[15] T. Itoh,et al. MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[16] Katherine H. Huang,et al. A framework for human microbiome research , 2012, Nature.

[17] Cynthia L Sears,et al. Microbes, microbiota, and colon cancer. , 2014, Cell host & microbe.

[18] Michael Y. Galperin,et al. The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[19] F. Shanahan,et al. Categorization of the gut microbiota: enterotypes or gradients? , 2012, Nature Reviews Microbiology.

[20] Sergey Koren,et al. Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[21] Medha Bhagwat,et al. Using BLAT to find sequence similarity in closely related genomes. , 2012, Current protocols in bioinformatics.

[22] Jens Roat Kultima,et al. An integrated catalog of reference genes in the human gut microbiome , 2014, Nature Biotechnology.

[23] M. Borodovsky,et al. Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[24] Carl Kingsford,et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[25] Jacques Monod,et al. The operon: a group of genes whose expression is co-ordinated by an operator. , 1960 .

[26] Anders F. Andersson,et al. Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[27] M. Borodovsky,et al. GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[28] Patrick J. Biggs,et al. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[29] Katharina J. Hoff,et al. Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[30] Steven Salzberg,et al. Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[31] Minghua Deng,et al. Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[32] S. Bordenstein,et al. Rethinking heritability of the microbiome , 2015, Science.

[33] Raymond Lo,et al. Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities , 2015, BMC Bioinformatics.

[34] Siu-Ming Yiu,et al. SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[35] Oscar P. Kuipers,et al. The relative value of operon predictions , 2008, Briefings Bioinform..

[36] J. Nicholson,et al. Host-Gut Microbiota Metabolic Interactions , 2012, Science.

[37] P. Ashton,et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island , 2014, Nature Biotechnology.

[38] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[39] Bernhard Y. Renard,et al. Analyzing genome coverage profiles with applications to quality control in metagenomics , 2013, Bioinform..

[40] D. Chia,et al. Variations of oral microbiota are associated with pancreatic diseases including pancreatic cancer , 2011, Gut.

[41] Katherine H. Huang,et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning , 2015, Nature Biotechnology.

[42] S. Sørensen,et al. Quantitative Metagenomic Analyses Based on Average Genome Size Normalization , 2011, Applied and Environmental Microbiology.

[43] Peer Bork,et al. MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit , 2012, PloS one.

[44] K. Hoff. Gene prediction in metagenomic sequencing reads , 2009 .

[45] Allyson L. Byrd,et al. Biogeography and individuality shape function in the human skin metagenome , 2014, Nature.

[46] P. Bork,et al. Get the most out of your metagenome: computational analysis of environmental sequence data. , 2007, Current opinion in microbiology.

[47] Derrick E. Wood,et al. Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[48] Nicole Dubilier,et al. Microbiology: Create a global microbiome effort , 2015, Nature.

[49] S Karlin,et al. Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[50] Hongfei Cui,et al. Alignment-free supervised classification of metagenomes by recursive SVM , 2013, BMC Genomics.

[51] P. Pevzner,et al. An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[52] Hiroyuki Ogata,et al. KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[53] D. Willner,et al. Metagenomic signatures of 86 microbial and viral metagenomes. , 2009, Environmental microbiology.

[54] Antti Honkela,et al. Exploration and retrieval of whole-metagenome sequencing samples , 2013, Bioinform..

[55] Maitreya J. Dunham,et al. Species-Level Deconvolution of Metagenome Assemblies with Hi-C–Based Contact Probability Maps , 2014, G3: Genes, Genomes, Genetics.

[56] Eugene W. Myers,et al. Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[57] Siu-Ming Yiu,et al. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[58] Kai Song,et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[59] Mark Borodovsky,et al. GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[60] Hideaki Tanaka,et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[61] B. Roe,et al. A core gut microbiome in obese and lean twins , 2008, Nature.

[62] Donovan Parks,et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes , 2014, PeerJ.

[63] Robert C. Edgar,et al. BIOINFORMATICS APPLICATIONS NOTE , 2001 .