Blocked Pattern Matching Problem and Its Applications in Proteomics

Matching a mass spectrum against a text (a key computational task in proteomics) is slow since the existing text indexing algorithms (with search time independent of the text size) are not applicable in the domain of mass spectrometry. As a result, many important applications (e.g., searches for mutated peptides) are prohibitively timeconsuming and even the standard search for non-mutated peptides is becoming too slow with recent advances in high-throughput genomics and proteomics technologies. We introduce a new paradigm - the Blocked Pattern Matching (BPM) Problem - that models peptide identification. BPM corresponds to matching a pattern against a text (over the alphabet of integers) under the assumption that each symbol a in the pattern can match a block of consecutive symbols in the text with total sum a. BPM opens a new, still unexplored, direction in combinatorial pattern matching and leads to the Mutated BPM (modeling identification of mutated peptides) and Fused BPM (modeling identification of fused peptides in tumor genomes). We illustrate how BPM algorithms solve problems that are beyond the reach of existing proteomics tools.

[1]  Ely Porat,et al.  On the Cost of Interchange Rearrangement in Strings , 2007, SIAM J. Comput..

[2]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[3]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[4]  Ely Porat,et al.  Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications , 2007, CPM.

[5]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  Kurt Mehlhorn,et al.  Lower bounds for set intersection queries , 1993, SODA '93.

[8]  Sean L Seymour,et al.  The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra*S , 2007, Molecular & Cellular Proteomics.

[9]  David E. Cardoze,et al.  Pattern matching for spatial point sets , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[10]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[11]  Laura F. Landweber,et al.  Rewiring the keyboard: evolvability of the genetic code , 2001, Nature Reviews Genetics.

[12]  Steven Skiena,et al.  Pattern matching with address errors: rearrangement distances , 2006, SODA 2006.

[13]  S. Muthukrishnan,et al.  New Results and Open Problems Related to Non-Standard Stringology , 1995, CPM.

[14]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[15]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[16]  R. Knight,et al.  Parallel Evolution of the Genetic Code in Arthropod Mitochondrial Genomes , 2006, PLoS biology.

[17]  Richard D. Smith,et al.  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. , 2007, Genome research.

[18]  Alan L Rockwood,et al.  Proteomic identification of oncogenic chromosomal translocation partners encoding chimeric anaplastic lymphoma kinase fusion proteins. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Jacob D. Jaffe,et al.  The complete genome and proteome of Mycoplasma mobile. , 2004, Genome research.

[20]  Nathan Edwards,et al.  Generating Peptide Candidates from Amino-Acid Sequence Databases for Protein Identification via Mass Spectrometry , 2002, WABI.

[21]  Ely Porat,et al.  Fast set intersection and two-patterns matching , 2009, Theor. Comput. Sci..

[22]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[23]  Pavel A Pevzner,et al.  Algorithm for identification of fusion proteins via mass spectrometry. , 2008, Journal of proteome research.

[24]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[25]  Michael J MacCoss,et al.  Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations. , 2008, Genome research.

[26]  P. Pevzner,et al.  Spectral Dictionaries , 2009, Molecular & Cellular Proteomics.

[27]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.

[28]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[29]  Michael Hoffmann,et al.  Algorithms - ESA 2007, 15th Annual European Symposium, Eilat, Israel, October 8-10, 2007, Proceedings , 2007, ESA.

[30]  Amihood Amir,et al.  Asynchronous Pattern Matching , 2006, CPM.

[31]  Dan Gusfield,et al.  Algorithms in Bioinformatics , 2002, Lecture Notes in Computer Science.

[32]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[33]  Yonatan Aumann,et al.  Approximate string matching with address bit errors , 2008, Theor. Comput. Sci..

[34]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[35]  Ely Porat,et al.  Approximate string matching with stuck address bits , 2010, Theor. Comput. Sci..

[36]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[37]  Nuno Bandeira,et al.  Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra* , 2011, Molecular & Cellular Proteomics.

[38]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[39]  Piotr Indyk,et al.  Efficient Computations of l1 and linfinity Rearrangement Distances , 2007, SPIRE.

[40]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[41]  Anders Krogh,et al.  Large-scale prokaryotic gene prediction and comparison to genome annotation , 2005, Bioinform..

[42]  Alejandro López-Ortiz LATIN 2010: Theoretical Informatics, 9th Latin American Symposium, Oaxaca, Mexico, April 19-23, 2010. Proceedings , 2010, Lecture Notes in Computer Science.

[43]  Gad M. Landau,et al.  Interchange rearrangement: The element-cost model , 2009, Theor. Comput. Sci..