Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique

BackgroundIn supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set.MethodsIn this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol.ResultsHoldout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations.ConclusionThese results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only.

[1]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[2]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[3]  José Luis de la Pompa,et al.  Negative Regulation of PKB/Akt-Dependent Cell Survival by the Tumor Suppressor PTEN , 1998, Cell.

[4]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[5]  M. Summers,et al.  Structural basis for targeting HIV-1 Gag proteins to the plasma membrane for virus assembly. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  D. Cafiso Structure and Interactions of C2 Domains at Membrane Surfaces , 2006 .

[7]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[10]  Wonhwa Cho,et al.  Membrane-protein interactions in cell signaling and membrane trafficking. , 2005, Annual review of biophysics and biomolecular structure.

[11]  Nitin Bhardwaj,et al.  Structural bioinformatics prediction of membrane-binding proteins. , 2006, Journal of molecular biology.

[12]  N. Bhardwaj,et al.  Learning to Translate Sequence and Structure to Function: Identifying DNA Binding and Membrane Binding Proteins , 2007, Annals of Biomedical Engineering.

[13]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[14]  M. Gelb,et al.  A Designed Probe for Acidic Phospholipids Reveals the Unique Enriched Anionic Character of the Cytosolic Face of the Mammalian Plasma Membrane* , 2004, Journal of Biological Chemistry.

[15]  J. Hurley,et al.  Membrane binding domains. , 2006, Biochimica et biophysica acta.

[16]  C. Sawyers,et al.  The phosphatidylinositol 3-Kinase–AKT pathway in human cancer , 2002, Nature Reviews Cancer.