Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds

BackgroundSpaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity.ResultsWe expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.

[1]  Bin Ma,et al.  On the complexity of the spaced seeds , 2007, J. Comput. Syst. Sci..

[2]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[3]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[4]  Martin C. Frith,et al.  Improved search heuristics find 20 000 new alignments between human and mouse genomes , 2014, Nucleic acids research.

[5]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[6]  Wei Zhang,et al.  SEPPA 2.0—more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigen , 2014, Nucleic Acids Res..

[7]  Giovanni Manzini,et al.  Spaced Seed Design Using Perfect Rulers , 2014, Fundam. Informaticae.

[8]  Maxime Crochemore,et al.  The Gapped Suffix Array: A New Index Structure for Fast Approximate Matching , 2010, SPIRE.

[9]  Leming Zhou,et al.  Universal seeds for cDNA-to-genome comparison , 2007, BMC Bioinformatics.

[10]  Dominique Lavenier,et al.  KLAST: fast and sensitive software to compare large genomic databanks on cloud , 2015 .

[11]  Yong Kong,et al.  Generalized Correlation Functions and Their Applications in Selection of Optimal Multiple Spaced Seeds for Homology Search , 2007, J. Comput. Biol..

[12]  Silvana Ilie Efficient computation of spaced seeds , 2011, BMC Research Notes.

[13]  Kun-Mao Chao,et al.  Efficient methods for generating optimal single and multiple spaced seeds , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[14]  Gary Benson,et al.  All hits all the time: parameter-free calculation of spaced seed sensitivity , 2009, Bioinform..

[15]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[16]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[17]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[18]  Gregory Kucherov,et al.  Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[19]  A. Gambin,et al.  On Subset Seeds for Protein Alignment , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Stefano Lonardi,et al.  Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers , 2015, WABI.

[21]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[22]  Lucian Ilie,et al.  Multiple spaced seeds for homology search , 2007, Bioinform..

[23]  Burkhard Morgenstern,et al.  rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison , 2015, PLoS Comput. Biol..

[24]  Dekel Tsur,et al.  Optimal Probing Patterns for Sequencing by Hybridization , 2006, WABI.

[25]  Bin Ma,et al.  Optimizing Spaced $k$-mer Neighbors for Efficient Filtration in Protein Similarity Search , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Hamid Mohamadi,et al.  BOND: Basic OligoNucleotide Design , 2013, BMC Bioinformatics.

[27]  Shengzhong Feng,et al.  A fast and flexible approach to oligonucleotide probe design for genomes and gene families , 2007, Bioinform..

[28]  Liisa Holm,et al.  SANSparallel: interactive homology search against Uniprot , 2015, Nucleic Acids Res..

[29]  Bin Ma,et al.  Rapid Homology Search with Neighbor Seeds , 2007, Algorithmica.

[30]  Louxin Zhang,et al.  Superiority and complexity of the spaced seeds , 2006, SODA '06.

[31]  Mehryar Mohri,et al.  Weighted Automata Algorithms , 2009 .

[32]  Gary Benson,et al.  Indel seeds for homology search , 2006, ISMB.

[33]  Frédéric Boyer,et al.  Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-factor Array , 2005, SPIRE.

[34]  Juha Kärkkäinen,et al.  One-Gapped q-Gram Filtersfor Levenshtein Distance , 2002, CPM.

[35]  Wei Chen,et al.  On half gapped seed. , 2003, Genome informatics. International Conference on Genome Informatics.

[36]  Gad M. Landau,et al.  Optimal spaced seeds for faster approximate string matching , 2005, J. Comput. Syst. Sci..

[37]  Franco P. Preparata,et al.  DNA Sequencing by Hybridization Using Semi-Degenerate Bases , 2004, J. Comput. Biol..

[38]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[39]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[40]  Eugene W. Myers,et al.  Error Tolerant Indexing and Alignment of Short Reads with Covering Template Families , 2010, J. Comput. Biol..

[41]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[42]  Gary Benson,et al.  All Hits All The Time: Parameter Free Calculation of Seed Sensitivity , 2007, APBC.

[43]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[44]  Ting Chen,et al.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds , 2009, Bioinform..

[45]  Seong-Bae Park,et al.  An empirical study of choosing efficient discriminative seeds for oligonucleotide design , 2009, BMC Genomics.

[46]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[47]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[48]  Lucian Ilie,et al.  Seeds for effective oligonucleotide design , 2011, BMC Genomics.

[49]  François Nicolas,et al.  Hardness of optimal spaced seed design , 2005, J. Comput. Syst. Sci..

[50]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[51]  Ke Chen,et al.  An efficient way of finding good indel seeds for local homology search , 2009 .

[52]  Giovanni Manzini,et al.  Better spaced seeds using Quadratic Residues , 2013, J. Comput. Syst. Sci..

[53]  Sven Rahmann,et al.  Probabilistic Arithmetic Automata and Their Applications , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[54]  Lucian Ilie,et al.  Fast computation of neighbor seeds , 2009, Bioinform..

[55]  Jialiang Yang,et al.  Run Probabilities of Seed-Like Patterns and Identifying Good Transition Seeds , 2008, J. Comput. Biol..

[56]  Seong-Bae Park,et al.  Hit integration for identifying optimal spaced seeds , 2010, BMC Bioinformatics.

[57]  Shuhei Mano Extreme sizes in Gibbs-type exchangeable random partitions , 2013 .

[58]  Bin Ma,et al.  Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design , 2009, Inf. Process. Lett..

[59]  Xavier Messeguer,et al.  Procrastination Leads to Efficient Filtration for Local Multiple Alignment , 2006, WABI.

[60]  Dominique Lavenier,et al.  PLAST: parallel local alignment search tool for database comparison , 2009, BMC Bioinformatics.

[61]  Paul Horton,et al.  A bioinformatician’s guide to the forefront of suffix array construction algorithms , 2014, Briefings Bioinform..

[62]  Stefano Lonardi,et al.  Higher classification sensitivity of short metagenomic reads with CLARK-S , 2016, bioRxiv.

[63]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[64]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[65]  Bin Ma,et al.  tPatternHunter: gapped, fast and sensitive translated homology search , 2005, Bioinform..

[66]  Tetsuo Shibuya,et al.  An Index Structure for Spaced Seed Search , 2011, ISAAC.

[67]  Philippe Flajolet,et al.  Motif Statistics , 1999, ESA.

[68]  Lavinia Egidi,et al.  Multiple seeds sensitivity using a single seed with threshold , 2015, J. Bioinform. Comput. Biol..

[69]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[70]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[71]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[72]  Louxin Zhang,et al.  Superiority of Spaced Seeds for Homology Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[73]  Giovanni Manzini,et al.  Spaced Seeds Design Using Perfect Rulers , 2011, SPIRE.

[74]  Justin Chu,et al.  Spaced Seed Data Structures for De Novo Assembly , 2015, International journal of genomics.

[75]  Lucian Ilie,et al.  SpEED: fast computation of sensitive spaced seeds , 2011, Bioinform..

[76]  Gregory Kucherov,et al.  Efficient alternatives to PSI-BLAST , 2012 .

[77]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[78]  Giovanni Manzini,et al.  Design and analysis of periodic multiple seeds , 2014, Theor. Comput. Sci..

[79]  Phan-Thuan Do,et al.  An improvement of the overlap complexity in the spaced seed searching problem between genomic DNAs , 2015, 2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science (NICS).

[80]  Donald E. K. Martin,et al.  A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances , 2014, J. Comput. Biol..

[81]  Donald E. K. Martin,et al.  Distributions associated with general runs and patterns in hidden Markov models , 2007, 0706.3985.

[82]  Smaine Mazouzi,et al.  Penguin Search Optimisation Algorithm for Finding Optimal Spaced Seeds , 2015, Int. J. Softw. Sci. Comput. Intell..

[83]  Donald E. K. Martin,et al.  Faster exact distributions of pattern statistics through sequential elimination of states , 2017 .

[84]  Lucian Ilie,et al.  SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..

[85]  Daniel G. Brown,et al.  Optimizing multiple seeds for protein homology search , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[86]  Travis Gagie,et al.  Compressed Spaced Suffix Arrays , 2014, ICABD.

[87]  Eugene W. Myers,et al.  What's Behind Blast , 2013, Models and Algorithms for Genome Evolution.

[88]  Thanh Hai Dang,et al.  AcoSeeD: An Ant Colony Optimization for Finding Optimal Spaced Seeds in Biological Sequence Search , 2012, ANTS.

[89]  Gregory Kucherov,et al.  Designing Efficient Spaced Seeds for SOLiD Read Mapping , 2010, Adv. Bioinformatics.

[90]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[91]  Franco P. Preparata,et al.  Quick, Practical Selection of Effective Seeds for Homology Search , 2005, J. Comput. Biol..

[92]  Cedric Chauve,et al.  Models and Algorithms for Genome Evolution , 2013, Computational Biology.

[93]  Liang Huang Dynamic Programming Algorithms in Semiring and Hypergraph Frameworks , 2006 .

[94]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[95]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[96]  Gregory Kucherov,et al.  Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..