HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution

Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial data sets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 data sets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.

[1]  Steven J. M. Jones,et al.  The Atlantic salmon genome provides insights into rediploidization , 2016, Nature.

[2]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[3]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[4]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[5]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[6]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[7]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[8]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[9]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[10]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[11]  Jacqueline A. Keane,et al.  Circlator: automated circularization of genome assemblies using long sequencing reads , 2015, Genome Biology.

[12]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[13]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[14]  Ilan Shomorony,et al.  Partial DNA assembly: A rate-distortion perspective , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[15]  Ilan Shomorony,et al.  Information-optimal genome assembly via sparse read-overlap graphs , 2016, Bioinform..

[16]  James R. Knight,et al.  An improved genome assembly uncovers prolific tandem repeats in Atlantic cod , 2016, bioRxiv.

[17]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[18]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[19]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[20]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[21]  Pavel A. Pevzner,et al.  Assembly of long error-prone reads using de Bruijn graphs , 2016, Proceedings of the National Academy of Sciences.

[22]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[23]  Eugene W. Myers A history of DNA sequence assembly , 2016, it Inf. Technol..

[24]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[25]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[26]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[27]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[28]  Pavel A. Pevzner,et al.  EULER-PCR: Finishing Experiments for Repeat Resolution , 2001, Pacific Symposium on Biocomputing.

[29]  Pavel A. Pevzner,et al.  DNA physical mapping and alternating Eulerian cycles in colored graphs , 1995, Algorithmica.