Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

BackgroundDetermination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.ResultsIn this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).ConclusionsThe results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.

[1]  Nicolas Le Roux,et al.  Label Propagation and Quadratic Criterion , 2006, Semi-Supervised Learning.

[2]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[3]  Xiaojin Zhu,et al.  Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization , 2006 .

[4]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[5]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[6]  Gustavo Camps-Valls,et al.  Semi-Supervised Graph-Based Hyperspectral Image Classification , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[9]  Yanjun Qi,et al.  Semi-Supervised Sequence Labeling with Self-Learned Features , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[10]  Zheng Yuan Prediction of protein subcellular locations using Markov chain models , 1999, FEBS letters.

[11]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[12]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[13]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[14]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[17]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[18]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[19]  Cheng Soon Ong,et al.  An Automated Combination of Kernels for Predicting Protein Subcellular Localization , 2007, WABI.

[20]  BMC Bioinformatics , 2005 .

[21]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.

[22]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[23]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[24]  Michael T. Hallett,et al.  Refining Protein Subcellular Localization , 2005, PLoS Comput. Biol..

[25]  Vladimir Pavlovic,et al.  Efficient use of unlabeled data for protein sequence classification: a comparative study , 2009, BMC Bioinformatics.

[26]  Vasant Honavar,et al.  Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data , 2006, Knowledge and Information Systems.

[27]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[28]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[29]  Neil D. Lawrence,et al.  Semi-supervised Learning via Gaussian Processes , 2004, NIPS.

[30]  Vasant Honavar,et al.  Abstraction Augmented Markov Models , 2010, 2010 IEEE International Conference on Data Mining.

[31]  Tommi S. Jaakkola,et al.  Information Regularization with Partially Labeled Data , 2002, NIPS.

[32]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[33]  Oliver Kohlbacher,et al.  MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition , 2006, Bioinform..

[34]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[35]  Dong-Hong Ji,et al.  Word Sense Disambiguation Using Label Propagation Based Semi-Supervised Learning , 2005, ACL.

[36]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[37]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[38]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[39]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[40]  Qiang Yang,et al.  Semi-supervised protein subcellular localization , 2009, BMC Bioinformatics.

[41]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.