Haplotype-aware diplotyping from noisy long reads

Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.

[1]  Jordan M. Eizenga,et al.  Mapping DNA Methylation with High Throughput Nanopore Sequencing , 2017, Nature Methods.

[2]  Tobias Marschall,et al.  Selecting Reads for Haplotype Assembly , 2016, bioRxiv.

[3]  Mauro Maggioni,et al.  Genomic Characterization of Large Heterochromatic Gaps in the Human Genome Assembly , 2014, PLoS Comput. Biol..

[4]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[5]  Paola Bonizzoni,et al.  HapCol: accurate and memory-efficient haplotype assembly from long reads , 2016, Bioinform..

[6]  Gustavo Glusman,et al.  Whole-genome haplotyping approaches and genomic medicine , 2014, Genome Medicine.

[7]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[8]  Alexander Schönhuth,et al.  A high-quality human reference panel reveals the complexity and distribution of genomic structural variants , 2016, Nature communications.

[9]  Michael C. Schatz,et al.  Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing , 2018, bioRxiv.

[10]  Shyr Yu,et al.  Genome measures used for quality control are dependent on gene function and ancestry , 2015, Bioinform..

[11]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[12]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[13]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[14]  Winston Timp,et al.  Detecting DNA cytosine methylation using nanopore sequencing , 2017, Nature Methods.

[15]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[16]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[17]  Dan Wang,et al.  Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data , 2018, Bioinform..

[18]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[19]  Harvey J. Greenberg,et al.  Opportunities for Combinatorial Optimization in Computational Biology , 2004, INFORMS J. Comput..

[20]  Chunlin Xiao,et al.  Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials , 2018, bioRxiv.

[21]  Alexander Schönhuth,et al.  Genotyping inversions and tandem duplications , 2017, Bioinform..

[22]  Leo van Iersel,et al.  The Complexity of the Single Individual SNP Haplotyping Problem , 2005, Algorithmica.

[23]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[24]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[25]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[26]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[27]  S. Jeffery Evolution of Protein Molecules , 1979 .

[28]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[29]  Wentian Li,et al.  Mappability and read length , 2014, Front. Genet..

[30]  Byoung-Tak Zhang,et al.  Survey of computational haplotype determination methods for single individual , 2015, Genes & Genomics.

[31]  Terence Hwa,et al.  Substantial Regional Variation in Substitution Rates in the Human Genome: Importance of GC Content, Gene Density, and Telomere-Specific Effects , 2005, Journal of Molecular Evolution.

[32]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[33]  William B Dunbar,et al.  Error analysis of idealized nanopore sequencing , 2013, Electrophoresis.

[34]  Shilpa Garg,et al.  A graph-based approach to diploid genome assembly , 2018, Bioinform..

[35]  Paola Bonizzoni,et al.  The Haplotyping problem: An overview of computational models and solutions , 2003, Journal of Computer Science and Technology.

[36]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[37]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[38]  Victor Guryev,et al.  Dense and accurate whole-chromosome haplotyping of individual genomes , 2017, Nature Communications.

[39]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[40]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[41]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[42]  Tobias Marschall,et al.  A Guided Tour to Computational Haplotyping , 2017, CiE.

[43]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[44]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[45]  Shilpa Garg,et al.  WhatsHap: fast and accurate read-based phasing , 2016, bioRxiv.

[46]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[47]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[48]  David Haussler,et al.  ENCODE Data in the UCSC Genome Browser: year 5 update , 2012, Nucleic Acids Res..

[49]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[50]  Ruibang Luo,et al.  A multi-task convolutional deep neural network for variant calling in single molecule sequencing , 2019, Nature Communications.