Optimal seed solver: optimizing seed selection in read mapping

MOTIVATION Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds. RESULTS We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-base-pair read in [Formula: see text] operations on average and in [Formula: see text] operations in the worst case, while generating a maximum of [Formula: see text] seed frequency database lookups. We compare OSS against four state-of-the-art seed selection schemes and observe that OSS provides a 3-fold reduction in average seed frequency over the best previous seed selection optimizations. AVAILABILITY AND IMPLEMENTATION We provide an implementation of the Optimal Seed Solver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-Solver CONTACT hxin@cmu.edu, calkan@cs.bilkent.edu.tr or onur@cmu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[2]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[3]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[4]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[9]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[10]  Philip L. F. Johnson,et al.  Genetic history of an archaic hominin group from Denisova Cave in Siberia , 2010, Nature.

[11]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[12]  Albert J. Vilella,et al.  Insights into hominid evolution from the gorilla genome sequence , 2012, Nature.

[13]  Thomas Meitinger,et al.  Loss-of-function mutations in SLC30A8 protect against type 2 diabetes , 2014, Nature Genetics.

[14]  Christophe Dessimoz,et al.  SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2 , 2008, BMC Research Notes.

[15]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[16]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[17]  Adrian W. Briggs,et al.  Individual A High-Coverage Genome Sequence from an Archaic Denisovan , 2012 .

[18]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[19]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[20]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[21]  Adrian W. Briggs,et al.  A High-Coverage Genome Sequence from an Archaic Denisovan Individual , 2012, Science.

[22]  Arcadi Navarro,et al.  A burst of segmental duplications in the genome of the African great ape ancestor , 2009, Nature.

[23]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[24]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[25]  Rob Pieters,et al.  PHF6 mutations in T-cell acute lymphoblastic leukemia , 2010, Nature Genetics.

[26]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[27]  Nagesh V. Honnalli,et al.  Hobbes: optimized gram-based methods for efficient read alignment , 2011, Nucleic acids research.

[28]  J. Troge,et al.  Tumour evolution inferred by single-cell sequencing , 2011, Nature.

[29]  Lavinia Egidi,et al.  Multiple seeds sensitivity using a single seed with threshold , 2015, J. Bioinform. Comput. Biol..

[30]  Arcadi Navarro,et al.  Great ape genetic diversity and population history , 2013, Nature.

[31]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[32]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[33]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[34]  Onur Mutlu,et al.  Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping , 2015, Bioinform..

[35]  Arcadi Navarro,et al.  Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee. , 2011, Genome research.

[36]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[37]  L. Christophorou Science , 2018, Emerging Dynamics: Science, Energy, Society and Values.

[38]  Dekel Tsur,et al.  Approximate String Matching using a Bidirectional Index , 2013, Theor. Comput. Sci..

[39]  Emily H Turner,et al.  Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome , 2010, Nature Genetics.