HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution

Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce mis-assemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial datasets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 datasets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.

[1]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[2]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[3]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[4]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[5]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[6]  Ilan Shomorony,et al.  Partial DNA assembly: A rate-distortion perspective , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[7]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[8]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[9]  Eugene W. Myers A history of DNA sequence assembly , 2016, it Inf. Technol..

[10]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[11]  Pavel A. Pevzner,et al.  DNA physical mapping and alternating Eulerian cycles in colored graphs , 1995, Algorithmica.

[12]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[13]  Jacqueline A. Keane,et al.  Circlator: automated circularization of genome assemblies using long sequencing reads , 2015, Genome Biology.

[14]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[15]  Kjetill S. Jakobsen,et al.  An improved genome assembly uncovers prolific tandem repeats in Atlantic cod , 2016 .

[16]  James R. Knight,et al.  An improved genome assembly uncovers prolific tandem repeats in Atlantic cod , 2016, bioRxiv.

[17]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[18]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[19]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[20]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[21]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[22]  Pavel A. Pevzner,et al.  Assembly of long error-prone reads using de Bruijn graphs , 2016, Proceedings of the National Academy of Sciences.

[23]  Pavel A. Pevzner,et al.  EULER-PCR: Finishing Experiments for Repeat Resolution , 2001, Pacific Symposium on Biocomputing.

[24]  Steven J. M. Jones,et al.  The Atlantic salmon genome provides insights into rediploidization , 2016, Nature.

[25]  Ilan Shomorony,et al.  Information-optimal genome assembly via sparse read-overlap graphs , 2016, Bioinform..

[26]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[27]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[28]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[29]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..