Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

Title of dissertation: Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms Sergey Koren, Doctor of Philosophy, 2012 Dissertation directed by: Professor Mihai Pop Department of Computer Science Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, wholegenome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp– 700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction. Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the downsides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies. The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop. Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery. Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

[1]  J. Roach,et al.  Modeling the feasibility of whole genome shotgun sequencing using a pairwise end strategy. , 2000, Genomics.

[2]  Natalia N. Ivanova,et al.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea , 2009, Nature.

[3]  Wayne M. Getz,et al.  Strainer: software for analysis of population variation in community genomic datasets , 2007, BMC Bioinformatics.

[4]  Howard Ochman,et al.  The consequences of genetic drift for bacterial genome complexity. , 2009, Genome research.

[5]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[6]  F. Sanger,et al.  Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. , 1980, Journal of molecular biology.

[7]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[8]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[9]  Niall J. Haslam,et al.  An analysis of the feasibility of short read sequencing , 2005, Nucleic acids research.

[10]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[11]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[12]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[13]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[14]  Limin Fu,et al.  Artificial and natural duplicates in pyrosequencing reads of metagenomic data , 2010, BMC Bioinformatics.

[15]  C. Condon,et al.  Comparison of the expression of the seven ribosomal RNA operons in Escherichia coli. , 1992, The EMBO journal.

[16]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[17]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[18]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[19]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[20]  F. Dean,et al.  Rapid amplification of plasmid and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling circle amplification. , 2001, Genome research.

[21]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[22]  David A. Bader,et al.  National Laboratory Lawrence Berkeley National Laboratory Title A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets Permalink , 2009 .

[23]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[24]  Peer Bork,et al.  SmashCommunity: a metagenomic annotation and analysis tool , 2010, Bioinform..

[25]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[26]  P. Pevzner,et al.  Efficient de novo assembly of single-cell bacterial genomes from short-read data sets , 2011, Nature Biotechnology.

[27]  Knut Reinert,et al.  A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads , 2009, Bioinform..

[28]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[29]  A. Halpern,et al.  Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees , 2011, PloS one.

[30]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[31]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[32]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[33]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[34]  Aaron L. Halpern,et al.  Efficiently detecting polymorphisms during the fragment assembly process , 2002, ISMB.

[35]  Eugene W. Myers,et al.  The greedy path-merging algorithm for sequence assembly , 2001, RECOMB.

[36]  N. Kyrpides,et al.  Individual genome assembly from complex community short-read metagenomic datasets , 2011, The ISME Journal.

[37]  Patrick Chain,et al.  Finishing Repetitive Regions Automatically with Dupfinisher , 2006, BIOCOMP.

[38]  Sergey Koren,et al.  The bonobo genome compared with the chimpanzee and human genomes , 2012, Nature.

[39]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[40]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[41]  Daphne Koller,et al.  Genovo: De Novo Assembly for Metagenomes , 2010, RECOMB.

[42]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[43]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[44]  Sergey Koren,et al.  An algorithm for automated closure during assembly , 2010, BMC Bioinformatics.

[45]  N. W. Davis,et al.  Genome sequence of enterohaemorrhagic Escherichia coli O157:H7 , 2001, Nature.

[46]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[47]  Mihai Pop,et al.  Minimus: a fast, lightweight genome assembler , 2007, BMC Bioinformatics.

[48]  Mihai Pop,et al.  Shotgun Sequence Assembly , 2004, Adv. Comput..

[49]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[50]  Wayne M. Getz,et al.  Genetic Exchange Across a Species Boundary in the Archaeal Genus Ferroplasma , 2007, Genetics.

[51]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[52]  P. Zhou,et al.  Metagenomic Analysis of Viruses from Bat Fecal Samples Reveals Many Novel Viruses in Insectivorous Bats in China , 2012, Journal of Virology.

[53]  F. Rohwer,et al.  Metagenomics and future perspectives in virus discovery , 2012, Current Opinion in Virology.

[54]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[55]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[56]  A. Halpern,et al.  A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Robert L Strausberg,et al.  Emerging DNA sequencing technologies for human genomic medicine. , 2008, Drug discovery today.

[58]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[59]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[60]  Mihai Pop,et al.  Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies , 2011, BMC Bioinformatics.

[61]  Matthew Berriman,et al.  Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology , 2010, Bioinform..

[62]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[63]  S. Salzberg,et al.  PhymmBL expanded: confidence scores, custom databases, parallelization and more , 2011, Nature Methods.

[64]  David R. Kelley,et al.  Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering , 2011, Nucleic acids research.

[65]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[66]  Adam M. Phillippy,et al.  Interactive metagenomic visualization in a Web browser , 2011, BMC Bioinformatics.

[67]  T. Scheffer,et al.  Taxonomic metagenome sequence assignment with structured output models , 2011, Nature Methods.

[68]  Jo Handelsman,et al.  Toward a Census of Bacteria in Soil , 2006, PLoS Comput. Biol..

[69]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[70]  Christine L. Sun,et al.  Community Genomic and Proteomic Analyses of Chemoautotrophic Iron-Oxidizing “Leptospirillum rubarum” (Group II) and “Leptospirillum ferrodiazotrophum” (Group III) Bacteria in Acid Mine Drainage Biofilms , 2009, Applied and Environmental Microbiology.

[71]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[72]  Phillip I. Tarr,et al.  Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery , 2008, PLoS pathogens.

[73]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[74]  Howard Ochman,et al.  Comparative Metagenomics and Population Dynamics of the Gut Microbiota in Mother and Infant , 2010, Genome biology and evolution.

[75]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[76]  Johannes Goll,et al.  Bioinformatics Applications Note Database and Ontologies Metarep: Jcvi Metagenomics Reports—an Open Source Tool for High-performance Comparative Metagenomics , 2022 .

[77]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[78]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[79]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[80]  Timothy J. Hazen Automatic alignment and error correction of human generated transcripts for long speech recordings , 2006, INTERSPEECH.

[81]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[82]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[83]  James H. Bullard,et al.  The origin of the Haitian cholera outbreak strain. , 2011, The New England journal of medicine.

[84]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[85]  F. Bushman,et al.  The Macaque Gut Microbiome in Health, Lentiviral Infection, and Chronic Enterocolitis , 2008, PLoS pathogens.

[86]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[87]  Timothy B. Stockwell,et al.  Mechanism of chimera formation during the Multiple Displacement Amplification reaction , 2007, BMC biotechnology.

[88]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[89]  A Danchin,et al.  Cloning and assembly strategies in microbial genome projects. , 1999, Microbiology.

[90]  Albert J. Vilella,et al.  The genome of a songbird , 2010, Nature.

[91]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[92]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[93]  M. Lipinski,et al.  Multiple displacement amplification for complex mixtures of DNA fragments , 2008, BMC Genomics.

[94]  Natalia N. Ivanova,et al.  Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite , 2007, Nature.

[95]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[96]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[97]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[98]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[99]  H. Ochman,et al.  Illumina-based analysis of microbial community diversity , 2011, The ISME Journal.

[100]  Ron Y. Pinter,et al.  A Statistical Framework for the Functional Analysis of Metagenomes , 2008, RECOMB.

[101]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[102]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[103]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[104]  Victor Markowitz,et al.  Complete genome sequence of Pedobacter heparinus type strain (HIM 762-3T) , 2009, Standards in genomic sciences.

[105]  M. Vignuzzi,et al.  Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population , 2006, Nature.

[106]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[107]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[108]  Mark Borodovsky,et al.  Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm , 2003, Current protocols in bioinformatics.

[109]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[110]  J. Handelsman,et al.  Introducing SONS, a Tool for Operational Taxonomic Unit-Based Comparisons of Microbial Community Memberships and Structures , 2006, Applied and Environmental Microbiology.

[111]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[112]  Steven H. Hinrichs,et al.  RAIphy: Phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles , 2011, BMC Bioinformatics.

[113]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[114]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[115]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000, Softw. Pract. Exp..

[116]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[117]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[118]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[119]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[120]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[121]  L. T. Angenent,et al.  Succession of microbial consortia in the developing infant gut microbiome , 2010, Proceedings of the National Academy of Sciences.

[122]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[123]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[124]  J. Corbeil,et al.  Metagenomic Analysis of Stress Genes in Microbial Mat Communities from Antarctica and the High Arctic , 2011, Applied and Environmental Microbiology.

[125]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[126]  Sallie W. Chisholm,et al.  Unlocking Short Read Sequencing for Metagenomics , 2010, PloS one.

[127]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[128]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[129]  John D. Kececioglu,et al.  Separating repeats in DNA sequence assembly , 2001, RECOMB.

[130]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[131]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[132]  Joakim Lundeberg,et al.  Generations of sequencing technologies. , 2009, Genomics.

[133]  Mihai Pop,et al.  MetaPhyler: Taxonomic profiling for metagenomic sequences , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[134]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, RECOMB.

[135]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[136]  R. Knight,et al.  Species divergence and the measurement of microbial diversity. , 2008, FEMS microbiology reviews.

[137]  S. Salzberg,et al.  The Value of Complete Microbial Genome Sequencing (You Get What You Pay For) , 2002, Journal of bacteriology.

[138]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[139]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[140]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[141]  C. Desmarais,et al.  Automated finishing with autofinish. , 2001, Genome research.

[142]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[143]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[144]  David A. Bader,et al.  Generalizing k-Betweenness Centrality Using Short Paths and a Parallel Multithreaded Implementation , 2009, 2009 International Conference on Parallel Processing.

[145]  Vincent J. Denef,et al.  Population Genomic Analysis of Strain Variation in Leptospirillum Group II Bacteria Involved in Acid Mine Drainage Formation , 2008, PLoS biology.

[146]  B. Mishra,et al.  Feature-by-Feature – Evaluating De Novo Sequence Assembly , 2012, PloS one.

[147]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[148]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[149]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[150]  Jonathan A. Eisen,et al.  The Phylogenetic Diversity of Metagenomes , 2011, PloS one.

[151]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[152]  Florent E. Angly,et al.  Next Generation Sequence Assembly with AMOS , 2011, Current protocols in bioinformatics.

[153]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[154]  James H. Bullard,et al.  Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. , 2011, The New England journal of medicine.

[155]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[156]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[157]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[158]  R. Staden A strategy of DNA sequencing employing computer programs. , 1979, Nucleic acids research.

[159]  Adam M. Phillippy,et al.  Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies , 2013, Briefings Bioinform..

[160]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[161]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[162]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[163]  Peter Meinicke,et al.  Mixture models for analysis of the taxonomic composition of metagenomes , 2011, Bioinform..

[164]  Jeff Kline,et al.  Architectural design influences the diversity and structure of the built environment microbiome , 2012, The ISME Journal.

[165]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[166]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[167]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[168]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[169]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[170]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[171]  Gabriel Valiente,et al.  Flexible taxonomic assignment of ambiguous sequencing reads , 2011, BMC Bioinformatics.

[172]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..