Improved classification model for peptide identification based on self-paced learning

Post-database searching is a key procedure for peptide spectrum matches (PSMs) in protein identification with mass spectrometry-based strategies. Although many machine learning-based approaches have been developed to improve the accuracy of peptide identification, the challenge remains for improvement due to the poor quality of data samples. CRanker has shown its effectiveness and efficiency in terms of the number of identified PSMs compared with benchmark algorithms. However, it has two weaknesses: overfitting and instability on small-sized datasets. In this paper, we incorporate two new strategies into CRanker to tackle its weaknesses. First of all, we modify the CRanker model by using different weight parameters for the learning losses of decoy and target PSMs. Moreover, we employ self-paced learning in training process to help the classifier getting avoid of those incorrect PSMs. Experimental studies show the modified CRanker with new strategies is more stable than the original one and outperforms benchmark methods in terms of the number of identified PSMs at the same false discovery rates (FDRs).

[1]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[2]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[3]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[4]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[5]  Deyu Meng,et al.  What Objective Does Self-paced Learning Indeed Optimize? , 2015, ArXiv.

[6]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[7]  Steven P Gygi,et al.  Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations , 2005, Nature Methods.

[8]  Hongwei Zhang,et al.  Peptide identification based on fuzzy classification and clustering , 2013, Proteome Science.

[9]  J. L. Jennings,et al.  A novel algorithm for validating peptide identification from a shotgun proteomics search engine. , 2013, Journal of proteome research.

[10]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[11]  Zhonghang Xia,et al.  An adaptive classification model for peptide identification , 2015, BMC Genomics.

[12]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[13]  Bilge Mutlu,et al.  How Do Humans Teach: On Curriculum Learning and Teaching Dimension , 2011, NIPS.

[14]  Pengyu Hong,et al.  PPIRank - an advanced method for ranking protein-protein interations in TAP/MS data , 2013, Proteome Science.