Support Vector Machine Classification of Probability Models and Peptide Features for Improved Peptide Identification from Shotgun Proteomics

Extracting association rules from large datasets typically results in a huge amount of rules. An approach to tackle this problem is to filter the resulting rule set, which reduces the rules, at the cost of also eliminating potentially interesting ones. In exploring a new dataset in search of relevant associations, it may be more useful for miners to have an overview of the space of rules obtainable from the dataset, rather than getting an arbitrary set satisfying high values for given interest measures. We describe a rule extraction approach that favors rule diversity, allowing miners to gain an overview of the rule space while reducing semantic redundancy within the rule set. This approach adopts an itemset-driven rule generation coupled with a cluster-based filtering process. The set of rules so obtained provides a starting point for a user-driven exploration of it.

[1]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[2]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[3]  Philip S. Yu,et al.  A New Approach to Online Generation of Association Rules , 2001, IEEE Trans. Knowl. Data Eng..

[4]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[5]  Kate Smith-Miles,et al.  A New Approach of Eliminating Redundant Association Rules , 2004, DEXA.

[6]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[7]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[8]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[9]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[10]  Alejandro Heredia-Langner,et al.  Comparison of probability and likelihood models for peptide identification from tandem mass spectrometry data. , 2005, Journal of proteome research.

[11]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[12]  Solange Oliveira Rezende,et al.  A methodology for identifying interesting association rules by combining objective and subjective measures , 2006, Inteligencia Artif..

[13]  Bart Goethals,et al.  A priori versus a posteriori filtering of association rules , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[14]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[15]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[16]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[17]  Richard D. Smith,et al.  Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. , 2004, Journal of proteome research.

[18]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[19]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[20]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[21]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[22]  B. Shekar,et al.  Interestingness of association rules in data mining: Issues relevant to e-commerce , 2005 .

[23]  Brendan MacLean,et al.  General framework for developing and evaluating database scoring algorithms using the TANDEM search engine , 2006, Bioinform..

[24]  Alípio Mário Jorge Hierarchical Clustering for Thematic Browsing and Summarization of Large Sets of Association Rules , 2004, SDM.

[25]  Adam Buciński,et al.  Artificial neural network analysis for evaluation of peptide MS/MS spectra in proteomics. , 2004, Analytical chemistry.

[26]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.