Platanus-allee is a de novo haplotype assembler enabling a comprehensive access to divergent heterozygous regions

The ultimate goal for diploid genome determination is to completely decode homologous chromosomes independently, and several phasing programs from consensus sequences have been developed. These methods work well for lowly heterozygous genomes, but the manifold species have high heterozygosity. Additionally, there are highly divergent regions (HDRs), where the haplotype sequences differ considerably. Because HDRs are likely to direct various interesting biological phenomena, many genomic analysis targets fall within these regions. However, they cannot be accessed by existing phasing methods, and we have to adopt costly traditional methods. Here, we develop a de novo haplotype assembler, Platanus-allee (http://platanus.bio.titech.ac.jp/platanus2), which initially constructs each haplotype sequence and then untangles the assembly graphs utilizing sequence links and synteny information. A comprehensive benchmark analysis reveals that Platanus-allee exhibits high recall and precision, particularly for HDRs. Using this approach, previously unknown HDRs are detected in the human genome, which may uncover novel aspects of genome variability.Most phasing programmes for sequencing data work well for genomes with low heterozygosity but drop in performance in regions of high heterozygosity. Here, Kajitani et al. develop the assembler Platanus-allee and demonstrate its utility in de novo assemblies of various genomes and the human MHC region.

[1]  Zhao Ma,et al.  Engineering Novel Molecular Beacon Constructs to Study Intracellular RNA Dynamics and Localization , 2017, Genom. Proteom. Bioinform..

[2]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[3]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[4]  Anthony R. Borneman,et al.  Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies , 2018, BMC Bioinformatics.

[5]  R. Durbin,et al.  Inferring human population size and separation history from multiple genome sequences , 2014, Nature Genetics.

[6]  Heini M. Natri,et al.  Progressive Recombination Suppression and Differentiation in Recently Evolved Neo-sex Chromosomes , 2013, Molecular biology and evolution.

[7]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[8]  Marghoob Mohiyuddin,et al.  LongISLND: in silico sequencing of lengthy and noisy datatypes , 2016, Bioinform..

[9]  Aaron M. Newman,et al.  The genome sequence of the colonial chordate, Botryllus schlosseri , 2013, eLife.

[10]  M. Hart Structure and evolution of the sea star egg receptor for sperm bindin , 2013, Molecular ecology.

[11]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[12]  Jian Wang,et al.  De novo assembly of a haplotype-resolved human genome , 2015, Nature Biotechnology.

[13]  Richard J. Challis,et al.  Genomic islands of speciation separate cichlid ecomorphs in an East African crater lake , 2015, Science.

[14]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[15]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[16]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[17]  S. Kerje,et al.  Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax) , 2015, Nature Genetics.

[18]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[19]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[20]  L. Keller,et al.  A Y-like social chromosome causes alternative colony organization in fire ants , 2013, Nature.

[21]  J. Knight,et al.  Major histocompatibility complex genomics and human disease. , 2013, Annual review of genomics and human genetics.

[22]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[23]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[24]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[25]  Lijun Wu,et al.  Comparative Analysis of Bat Genomes Provides Insight into the Evolution of Flight and Immunity , 2013, Science.

[26]  T. Itoh,et al.  A genetic mechanism for female-limited Batesian mimicry in Papilio butterfly , 2015, Nature Genetics.

[27]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[28]  Nicholas H. Putnam,et al.  The amphioxus genome and the evolution of the chordate karyotype , 2008, Nature.

[29]  Ruihua Wang,et al.  Decelerated genome evolution in modern vertebrates revealed by analysis of multiple lancelet genomes , 2014, Nature Communications.

[30]  T. Itoh,et al.  Next-generation sequencing analysis of lager brewing yeast strains reveals the evolutionary history of interspecies hybridization , 2016, DNA research : an international journal for rapid publication of reports on genes and genomes.

[31]  Michael C. Schatz,et al.  LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning , 2017, Computational and structural biotechnology journal.

[32]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[33]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[34]  H. Hirano,et al.  S Locus F-Box Brothers: Multiple and Pollen-Specific F-Box Genes With S Haplotype-Specific Polymorphisms in Apple and Japanese Pear , 2007, Genetics.

[35]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[36]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[37]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[38]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[39]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[40]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[41]  Martin Vingron,et al.  Haplotype-resolved sweet potato genome traces back its hexaploidization history , 2017, Nature Plants.

[42]  Axel Himmelbach,et al.  Wild emmer genome architecture and diversity elucidate wheat evolution and domestication , 2017, Science.

[43]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[44]  J. Wingfield,et al.  A supergene determines highly divergent male reproductive morphs in the ruff , 2015, Nature Genetics.

[45]  Pall I. Olason,et al.  Linked selection and recombination rate variation drive the evolution of the genomic landscape of differentiation across the speciation continuum of Ficedula flycatchers , 2015, Genome research.

[46]  Matthew M. Hill,et al.  Extreme genomic variation in a natural population , 2007, Proceedings of the National Academy of Sciences.

[47]  L. B. Snoek,et al.  Remarkably Divergent Regions Punctuate the Genome Assembly of the Caenorhabditis elegans Hawaiian Strain CB4856 , 2015, Genetics.

[48]  Gil McVean,et al.  A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016 .

[49]  P. Kwok,et al.  A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing , 2016, Nature Methods.

[50]  Lars Bolund,et al.  Sequencing and de novo assembly of 150 genomes from Denmark as a population reference , 2017, Nature.

[51]  Pirita Paajanen,et al.  A critical comparison of technologies for a plant genome sequencing project , 2017, bioRxiv.

[52]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[53]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[54]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[55]  Joo-Hwan Kim,et al.  Draft genome sequence of wild Prunus yedoensis reveals massive inter-specific hybridization between sympatric flowering cherries , 2018, Genome Biology.

[56]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[57]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[58]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Piero Carninci,et al.  Biased allelic expression in human primary fibroblast single cells. , 2015, American journal of human genetics.

[60]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[61]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[62]  F. Chédin,et al.  DNA Topoisomerase I differentially modulates R-loops across the human genome , 2018, Genome Biology.