Sparse Markov chain-based semi-supervised multi-instance multi-label method for protein function prediction

Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.

[1]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[2]  Xiao Sun,et al.  A novel method for quantitatively predicting non-covalent interactions from protein and nucleic acid sequence. , 2011, Journal of molecular graphics & modelling.

[3]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[4]  Zhi-Hua Zhou,et al.  Semi-supervised multi-instance multi-label learning for video annotation task , 2012, ACM Multimedia.

[5]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[6]  Narmada Thanki,et al.  CDD: a Conserved Domain Database for the functional annotation of proteins , 2010, Nucleic Acids Res..

[7]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[8]  Zhi-Hua Zhou,et al.  Genome-Wide Protein Function Prediction through Multi-Instance Multi-Label Learning , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.

[10]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[11]  G. A. Edgar Measure, Topology, and Fractal Geometry , 1990 .

[12]  Yunming Ye,et al.  Protein functional properties prediction in sparsely-label PPI networks through regularized non-negative matrix factorization , 2015, BMC Systems Biology.

[13]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[14]  Zhi-Hua Zhou,et al.  Multi-instance multi-label learning , 2008, Artif. Intell..

[15]  Paul Walsh,et al.  An overview of in silico protein function prediction , 2010, Archives of Microbiology.

[16]  Zhi-Hua Zhou,et al.  M3MIML: A Maximum Margin Method for Multi-instance Multi-label Learning , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[18]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[19]  Yunming Ye,et al.  Semi-supervised multi-label collective classification ensemble for functional genomics , 2014, BMC Genomics.

[20]  Yunming Ye,et al.  Collective prediction of protein functions from protein-protein interaction networks , 2014, BMC Bioinformatics.

[21]  Gang Wang,et al.  An empirical study of automatic image annotation through Multi-Instance Multi-Label Learning , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[22]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[23]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[24]  Min-Ling Zhang,et al.  A k-Nearest Neighbor Based Multi-Instance Multi-Label Learning Algorithm , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[25]  Yunming Ye,et al.  Markov-Miml: A Markov chain-based multi-instance multi-label learning algorithm , 2012, Knowledge and Information Systems.

[26]  Ester Perales-Clemente,et al.  Allotopic expression of mitochondrial-encoded genes in mammals: achieved goal, undemonstrated mechanism or impossible task? , 2010, Nucleic Acids Res..

[27]  R. Nussinov,et al.  Hydrophobic folding units derived from dissimilar monomer structures and their interactions , 1997, Protein science : a publication of the Protein Society.

[28]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[29]  O. Kandler,et al.  Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. , 1990, Proceedings of the National Academy of Sciences of the United States of America.