Indexing schemes for similarity search in datasets of short protein fragments

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than 1% of the entire dataset.

[1]  CiacciaPaolo,et al.  Searching in metric spaces with user-defined and approximate distances , 2002 .

[2]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[3]  Marco Patella,et al.  A Query-sensitive Cost Model for Similarity Queries with M-tree , 1999, Australasian Database Conference.

[4]  Pavel Zezula,et al.  Processing Complex Similarity Queries with Distance-Based Access Methods , 1998, EDBT.

[5]  James Ze Wang,et al.  SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size , 2002, Bioinform..

[6]  Tamer Kahveci,et al.  An Efficient Index Structure for String Databases , 2001 .

[7]  Paolo Vitolo The representation of weighted quasimetric spaces , 1999 .

[8]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[9]  M. Gromov Metric Structures for Riemannian and Non-Riemannian Spaces , 1999 .

[10]  Robert Lowen,et al.  Handbook of the History of General Topology , 1997 .

[11]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[12]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[13]  D. Kitts,et al.  Bioactive proteins and peptides from food sources. Applications of bioprocesses used in isolation and recovery. , 2003, Current pharmaceutical design.

[14]  Christos Faloutsos,et al.  The "DGX" distribution for mining massive, skewed data , 2001, KDD '01.

[15]  Anthony K. H. Tung,et al.  The ed-tree: an index for large DNA sequence databases , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[16]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[17]  Ambuj K. Singh,et al.  ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure , 2004, Pacific Symposium on Biocomputing.

[18]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[19]  Aleksandar Stojmirovic Quasi-metric spaces with measure , 2003 .

[20]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[21]  Michael G. Walker,et al.  SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size , 2000 .

[22]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[23]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[24]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[25]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[26]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[28]  Dónall A. Mac Dónaill,et al.  Representation of amino acids as five-bit or three-bit patterns for filtering protein databases , 2001, Bioinform..

[29]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[30]  M McCreightEdward A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[31]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[32]  Hans-Peter A. Künzi,et al.  Nonsymmetric Distances and Their Associated Topologies: About the Origins of Basic Ideas in the Area of Asymmetric Topology , 2001 .

[33]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[34]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[35]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[36]  Marco Patella,et al.  Bulk Loading the M-tree , 2001 .

[37]  Marco Patella,et al.  Searching in metric spaces with user-defined and approximate distances , 2002, TODS.

[38]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[39]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[40]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..

[41]  Pavel Zezula,et al.  Processing M-trees with parallel resources , 1998, Proceedings Eighth International Workshop on Research Issues in Data Engineering. Continuous-Media Databases and Applications.

[42]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[43]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[44]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[45]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[46]  Daniel P. Miranker,et al.  An Assessment of a Metric Space Database Index to Support Sequence Homology , 2005, Int. J. Artif. Intell. Tools.

[47]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[48]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[49]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[50]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[51]  Ela Hunt Indexed Searching on Proteins Using a Suffix Sequoia , 2004, IEEE Data Eng. Bull..

[52]  Vladimir Pestov,et al.  Indexing Schemes for Similarity Search: an Illustrated Paradigm , 2002, Fundam. Informaticae.

[53]  Hans-Peter A. Künzi,et al.  Weighted Quasi‐Metrics , 1994 .

[54]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[55]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[56]  M. Ledoux The concentration of measure phenomenon , 2001 .

[57]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[58]  Ela Hunt The Suffix Sequoia Index for Approximate String Matching , 2003 .

[59]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[60]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[61]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[62]  BozkayaTolga,et al.  Distance-based indexing for high-dimensional metric spaces , 1997 .

[63]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[64]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[65]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[66]  P. Bieniasz,et al.  HIV-1 and Ebola virus encode small peptide motifs that recruit Tsg101 to sites of particle assembly to facilitate egress , 2001, Nature Medicine.