Recovery of gene haplotypes from a metagenome

Population-level diversity of natural microbiomes represent a biotechnological resource for biomining, biorefining and synthetic biology but requires the recovery of the exact DNA sequence (or “haplotype”) of the genes and genomes of every individual present. Computational haplotype reconstruction is extremely difficult, complicated by environmental sequencing data (metagenomics). Current approaches cannot choose between alternative haplotype reconstructions and fail to provide biological evidence of correct predictions. To overcome this, we present Hansel and Gretel: a novel probabilistic framework that reconstructs the most likely haplotypes from complex microbiomes, is robust to sequencing error and uses all available evidence from aligned reads, without altering or discarding observed variation. We provide the first formalisation of this problem and propose “metahaplome” as a definition for the set of haplotypes for any genomic region of interest within a metagenomic dataset. Finally, we demonstrate using long-read sequencing, biological evidence of novel haplotypes of industrially important enzymes computationally predicted from a natural microbiome.

[1]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[2]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[3]  Rubino Francesco,et al.  MGkit: Metagenomic Framework For The Study Of Microbial Communities , 2014 .

[4]  J. Edwards,et al.  Temporal dynamics of the metabolically active rumen bacteria colonizing fresh perennial ryegrass. , 2016, FEMS microbiology ecology.

[5]  Rob Knight,et al.  ConStrains identifies microbial strains in metagenomic datasets , 2015, Nature Biotechnology.

[6]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[7]  David R. Riley,et al.  Ten years of pan-genome analyses. , 2015, Current opinion in microbiology.

[8]  Russell Schwartz,et al.  SNPs Problems, Complexity, and Algorithms , 2001, ESA.

[9]  Se-kwon Kim,et al.  Research and Application of Marine Microbial Enzymes: Status and Prospects , 2010, Marine drugs.

[10]  Volker Roth,et al.  HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[12]  Sorin Istrail,et al.  Haplotype assembly in polyploid genomes and identical by descent shared tracts , 2013, Bioinform..

[13]  W. Martin,et al.  The tree of one percent , 2006, Genome Biology.

[14]  Paul Wilmes,et al.  The dynamic genetic repertoire of microbial communities , 2009, FEMS microbiology reviews.

[15]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[16]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[17]  Saman K. Halgamuge,et al.  ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing , 2015, Bioinform..

[18]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[19]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[20]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[21]  Leen Stougie,et al.  Full-length de novo viral quasispecies assembly through variation graph construction , 2019, Bioinform..

[22]  Rebecca Rose,et al.  Challenges in the analysis of viral metagenomes , 2016, Virus evolution.

[23]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[24]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[25]  Alison S. Waller,et al.  Genomic variation landscape of the human gut microbiome , 2012, Nature.

[26]  Filippo Geraci,et al.  A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem , 2010, Bioinform..

[27]  Michael C. Riley,et al.  PD5: A General Purpose Library for Primer Design Software , 2013, PloS one.

[28]  Kathleen Marchal,et al.  Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations , 2015, Nucleic acids research.

[29]  Giuseppe Lancia Algorithmic approaches for the single individual haplotyping problem , 2016, RAIRO Oper. Res..

[30]  Y. Kuzyakov,et al.  Soil microorganisms can overcome respiration inhibition by coupling intra- and extracellular metabolism: 13C metabolic tracing reveals the mechanisms , 2017, The ISME Journal.

[31]  Chirag Jain,et al.  MetaMaps – Strain-level metagenomic assignment and compositional estimation for long reads , 2018, bioRxiv.

[32]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[33]  Johannes Alneberg,et al.  DESMAN: a new tool for de novo extraction of strains from metagenomes , 2017, Genome Biology.

[34]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[35]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[36]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[37]  Axel Visel,et al.  the sheep rumen microbiome Methane yield phenotypes linked to differential gene expression in , 2014 .

[38]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[39]  Haris Vikalo,et al.  Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm , 2015, BMC Bioinformatics.

[40]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[41]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[42]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[43]  Priscilla E. M. Purnick,et al.  The second wave of synthetic biology: from modules to systems , 2009, Nature Reviews Molecular Cell Biology.

[44]  John Vollmers,et al.  Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! , 2017, PloS one.

[45]  James O. McInerney,et al.  CRANN: detecting adaptive evolution in protein-coding DNA sequences , 2003, Bioinform..

[46]  L. Stougie,et al.  Viral quasispecies reconstruction via contig abundance estimation in variation graphs , 2019 .

[47]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2012, RECOMB.

[48]  Hanqing Yu,et al.  Extracellular electron transfer mechanisms between microorganisms and minerals , 2016, Nature Reviews Microbiology.

[49]  Hideaki Tanaka,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[50]  Leen Stougie,et al.  Strain-aware assembly of genomes from mixed samples using flow variation graphs , 2019 .

[51]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[52]  Brian C. Thomas,et al.  Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy , 2017, Nature Microbiology.

[53]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[54]  M. Eigen,et al.  Viral quasispecies. , 1993, Scientific American.

[55]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[56]  Vincent Lombard,et al.  Cultivation and sequencing of rumen microbiome members from the Hungate1000 Collection , 2018, Nature Biotechnology.

[57]  Leo van Iersel,et al.  On the Complexity of Several Haplotyping Problems , 2005, WABI.

[58]  Wen-Hsiung Li Unbiased estimation of the rates of synonymous and nonsynonymous substitution , 2006, Journal of Molecular Evolution.

[59]  Christopher J. Creevey,et al.  CowPI: A Rumen Microbiome Focussed Version of the PICRUSt Functional Inference Software , 2018, Front. Microbiol..

[60]  Alexander Schönhuth,et al.  De novo assembly of viral quasispecies using overlap graphs , 2017, bioRxiv.

[61]  D. de Ridder,et al.  EXPLOITING NEXT GENERATION SEQUENCING TO SOLVE THE HAPLOTYPING PUZZLE IN POLYPLOIDS: A SIMULATION STUDY , 2016, bioRxiv.

[62]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.