HySA: A Hybrid Structural variant Assembly approach using next generation and single-molecule sequencing technologies

Achieving complete, accurate, and cost-effective assembly of human genomes is of great importance for realizing the promise of precision medicine. The abundance of repeats and genetic variations in human genomes and the limitations of existing sequencing technologies call for the development of novel assembly methods that can leverage the complementary strengths of multiple technologies. We propose a Hybrid Structural variant Assembly (HySA) approach that integrates sequencing reads from next-generation sequencing and single-molecule sequencing technologies to accurately assemble and detect structural variants (SVs) in human genomes. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance the assembly of structurally altered regions in human genomes. We used data generated from a haploid hydatidiform mole genome (CHM1) and a diploid human genome (NA12878) to test our approach. The result showed that, compared with existing methods, our approach had a low false discovery rate and substantially improved the detection of many types of SVs, particularly novel large insertions, small indels (10-50 bp), and short tandem repeat expansions and contractions. Our work highlights the strengths and limitations of current approaches and provides an effective solution for extending the power of existing sequencing technologies for SV discovery.

[1]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[2]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[3]  G. Weinstock,et al.  TIGRA: A targeted iterative graph routing assembler for breakpoint assembly , 2014, Genome research.

[4]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[5]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[6]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[7]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[8]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[9]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[10]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[11]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[12]  T. Wetter,et al.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. , 2004, Genome research.

[13]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[14]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[15]  Wanding Zhou,et al.  Convergent evolution of modularity in metabolic networks through different community structures , 2012, BMC Evolutionary Biology.

[16]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[17]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[18]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[21]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[22]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[23]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[24]  P. Gács,et al.  Algorithms , 1992 .

[25]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[26]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[27]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[28]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[29]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[30]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[31]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[32]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[33]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[34]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[35]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[36]  Mark Gerstein,et al.  Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants , 2009, PLoS Comput. Biol..

[37]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[38]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[39]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[40]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[41]  Adam C. English,et al.  PBHoney: identifying genomic variants via long-read discordance and interrupted mapping , 2014, BMC Bioinformatics.

[42]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[43]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[44]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[45]  David Hsu,et al.  Characterization of structural variants with single molecule and hybrid sequencing approaches , 2014, Bioinform..

[46]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[47]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[48]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[49]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[50]  Benjamin J. Raphael,et al.  An integrative probabilistic model for identification of structural variation in sequencing data , 2012, Genome Biology.

[51]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[52]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[53]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[54]  L. Ding,et al.  novoBreak: local assembly for breakpoint detection in cancer genomes , 2016, Nature Methods.

[55]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[56]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.