Fast and accurate correction of optical mapping data via spaced seeds

Abstract Motivation Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. Availability and implementation Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Christina Boucher,et al.  Error correcting optical mapping data , 2018, bioRxiv.

[2]  Ute Roessner,et al.  The genome of Chenopodium quinoa , 2017, Nature.

[3]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions , 2003, CPM.

[4]  Lucian Ilie,et al.  Multiple spaced seeds for homology search , 2007, Bioinform..

[5]  E. Eichler,et al.  Long-read sequencing and de novo assembly of a Chinese genome , 2016, Nature Communications.

[6]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[7]  Christina Boucher,et al.  Misassembly detection using paired-end sequence reads and optical mapping data , 2014, Bioinform..

[8]  O. White,et al.  Whole-genome shotgun optical mapping of Deinococcus radiodurans. , 1999, Science.

[9]  Juan J de Pablo,et al.  A microfluidic system for large DNA molecule arrays. , 2004, Analytical chemistry.

[10]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[11]  Giovanni Manzini,et al.  Better spaced seeds using Quadratic Residues , 2013, J. Comput. Syst. Sci..

[12]  Ming Xiao,et al.  Towards a More Accurate Error Model for BioNano Optical Maps , 2016, ISBRA.

[13]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[14]  J. Gouzy,et al.  High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development , 2017, Nature Genetics.

[15]  David C. Schwartz,et al.  Maligner: a fast ordered restriction map aligner , 2016, Bioinform..

[16]  Christina Boucher,et al.  A Succinct Solution to Rmap Alignment , 2018, WABI.

[17]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[18]  David C. Schwartz,et al.  An algorithm for assembly of ordered restriction maps from single DNA molecules , 2006, Proceedings of the National Academy of Sciences.

[19]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[20]  Ming Xiao,et al.  OMBlast: alignment tool for optical mapping using a seed-and-extend approach , 2016, Bioinform..

[21]  Christina Boucher,et al.  Efficient Indexed Alignment of Contigs to Optical Maps , 2014, WABI.

[22]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[23]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Bioinform..

[24]  Tao Jiang,et al.  OMGS: Optical Map-based Genome Scaffolding , 2019, bioRxiv.

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  D. Schwartz,et al.  Optical mapping: a novel, single-molecule approach to genomic analysis. , 1995, Genome research.

[27]  Stephane Rombauts,et al.  OMSim: a simulator for optical map data , 2017, Bioinform..

[28]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[29]  John K. McCooke,et al.  Construction of a map-based reference genome sequence for barley, Hordeum vulgare L. , 2017, Scientific Data.

[30]  David C. Schwartz,et al.  AGORA: Assembly Guided by Optical Restriction Alignment , 2012, BMC Bioinformatics.

[31]  Mihai Pop,et al.  Scaffolding and validation of bacterial genome assemblies using optical restriction maps , 2008, Bioinform..

[32]  Alan Christoffels,et al.  Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding , 2016, PLoS genetics.

[33]  James R. Knight,et al.  High-coverage sequencing and annotated assemblies of the budgerigar genome , 2014, GigaScience.

[34]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[35]  Yi Yang,et al.  Alignment of Optical Maps , 2005, RECOMB.

[36]  Deacon J. Sweeney,et al.  Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus) , 2012, Nature Biotechnology.

[37]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.