Detection of Outlier Residues for Improving Interface Prediction in Protein Hetero-complexes

Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions.

[1]  J. Janin,et al.  A dissection of specific and non-specific protein-protein interfaces. , 2004, Journal of molecular biology.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Xiaolong Wang,et al.  Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins , 2007, BMC Bioinformatics.

[4]  S. Jones,et al.  Principles of protein-protein interactions. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Oliviero Carugo,et al.  CX, an algorithm that identifies protruding atoms in proteins , 2002, Bioinform..

[6]  N. Ben-Tal,et al.  Residue frequencies and pairing preferences at protein–protein interfaces , 2001, Proteins.

[7]  Stephen Marsland,et al.  On-Line Novelty Detection through self-organisation with application to inspection robotics , 2001 .

[8]  P. Bourne,et al.  Exploiting sequence and structure homologs to identify protein–protein binding sites , 2005, Proteins.

[9]  Oliviero Carugo,et al.  DPX: for the analysis of the protein core , 2003, Bioinform..

[10]  A. Madansky Identification of Outliers , 1988 .

[11]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[12]  R. Kini,et al.  Prediction of potential protein‐protein interaction sites from amino acid sequence , 1996, FEBS letters.

[13]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[14]  J. Janin,et al.  Dissecting protein–protein recognition sites , 2002, Proteins.

[15]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[16]  Song Liu,et al.  Protein binding site prediction using an empirical scoring function , 2006, Nucleic acids research.

[17]  David R. Westhead,et al.  Improved prediction of protein-protein binding sites using a support vector machines approach. , 2005, Bioinformatics.

[18]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[19]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[20]  Zengyou He,et al.  Mining class outliers: concepts, algorithms and applications in CRM , 2004, Expert Syst. Appl..

[21]  Bonnie Berger,et al.  Struct2Net: Integrating Structure into Protein-Protein Interaction Prediction , 2005, Pacific Symposium on Biocomputing.

[22]  Jinyan Li,et al.  Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information , 2010, BMC Bioinformatics.

[23]  Sarah A. Teichmann,et al.  3D Complex: A Structural Classification of Protein Complexes , 2006, PLoS Comput. Biol..

[24]  R. Jernigan,et al.  Identification of kinetically hot residues in proteins , 1998, Protein science : a publication of the Protein Society.

[25]  Burkhard Rost,et al.  ISIS: interaction sites identified from sequence , 2007, Bioinform..

[26]  Z. Weng,et al.  Protein–protein docking benchmark 2.0: An update , 2005, Proteins.

[27]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[28]  J. Janin,et al.  Dissecting subunit interfaces in homodimeric proteins , 2003, Proteins.

[29]  G J Kleywegt,et al.  Phi/psi-chology: Ramachandran revisited. , 1996, Structure.

[30]  Aleksey A. Porollo,et al.  Prediction‐based fingerprints of protein–protein interactions , 2006, Proteins.

[31]  Zengyou He,et al.  Outlier Detection Integrating Semantic Knowledge , 2002, WAIM.

[32]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[33]  Huan-Xiang Zhou,et al.  Interaction-site prediction for protein complexes: a critical assessment , 2007, Bioinform..

[34]  R. Nussinov,et al.  How similar are protein folding and protein binding nuclei? Examination of vibrational motions of energy hot spots and conserved residues. , 2005, Biophysical journal.

[35]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[36]  Huan-Xiang Zhou,et al.  Prediction of interface residues in protein–protein complexes by a consensus neural network method: Test against NMR data , 2005, Proteins.

[37]  R. Laskowski SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. , 1995, Journal of molecular graphics.

[38]  Alfonso Valencia,et al.  Progress and challenges in predicting protein-protein interaction sites , 2008, Briefings Bioinform..

[39]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[40]  M. Šikić,et al.  PSAIA – Protein Structure and Interaction Analyzer , 2008, BMC Structural Biology.

[41]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[42]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[43]  R. Raz,et al.  ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[44]  S. Jones,et al.  Prediction of protein-protein interaction sites using patch analysis. , 1997, Journal of molecular biology.

[45]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[46]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[47]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[48]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[49]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[50]  P. Chakrabarti,et al.  Conservation and relative importance of residues across protein-protein interfaces , 2005, Proceedings of the National Academy of Sciences of the United States of America.