Classification and knowledge discovery in protein databases

We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Akira Iwata,et al.  A Solution for Imbalanced Training Sets Problem by CombNET-II and Its Application on Fog Forecasting , 2002 .

[3]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[4]  Stephen K. Burley,et al.  An overview of structural genomics , 2000, Nature Structural Biology.

[5]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[6]  Dana Angluin,et al.  Learning from noisy examples , 1988, Machine Learning.

[7]  Igor F. Tsigelny,et al.  Protein Structure Prediction , 2020, Methods in Molecular Biology.

[8]  Igor F. Tsigelny Protein Structure Prediction: Bioinformatic Approach , 2002 .

[9]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[11]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[12]  Zoran Obradovic,et al.  Classification on Data with Biased Class Distribution , 2001, ECML.

[13]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[14]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[15]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[16]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[17]  Robert N. McDonough,et al.  Detection of signals in noise , 1971 .

[18]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[19]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[20]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[21]  Marco Saerens,et al.  Adjusting the Outputs of a Classifier to New a Priori Probabilities May Significantly Improve Classification Accuracy: Evidence from a multi-class problem in remote sensing , 2001, ICML.

[22]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[23]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[24]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[25]  S. Vucetic,et al.  Flavors of protein disorder , 2003, Proteins.

[26]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[27]  Zoran Obradovic,et al.  Prediction of Boundaries Between Intrinsically Ordered and Disordered Protein Regions , 2002, Pacific Symposium on Biocomputing.

[28]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[29]  Malik Magdon-Ismail,et al.  Financial markets: very noisy information processing , 1998 .

[30]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[31]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[32]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[33]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[34]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[35]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[36]  Michael B. Yaffe,et al.  Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs , 2003, Nucleic Acids Res..

[37]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[38]  T. Stein International Geoscience And Remote Sensing Symposium , 1992, [Proceedings] IGARSS '92 International Geoscience and Remote Sensing Symposium.

[39]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[40]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[41]  N. Blom,et al.  Cleavage site analysis in picornaviral polyproteins: Discovering cellular targets by neural networks , 1996, Protein science : a publication of the Protein Society.

[42]  Malin M. Young,et al.  Predicting conformational switches in proteins , 1999, Protein science : a publication of the Protein Society.

[43]  Nikolaj Blom,et al.  PhosphoBase, a database of phosphorylation sites: release 2.0 , 1999, Nucleic Acids Res..

[44]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[45]  Jim Freeman,et al.  Outliers in Statistical Data (3rd edition) , 1995 .

[46]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[47]  Haim Shvaytser,et al.  A Necessary Condition for Learning from Positive Examples , 1990, Machine Learning.

[48]  Zoran Obradovic,et al.  Predicting intrinsic disorder from amino acid sequence , 2003, Proteins.

[49]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[50]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[51]  Mohammad Bagher Menhaj,et al.  Training feedforward networks with the Marquardt algorithm , 1994, IEEE Trans. Neural Networks.

[52]  Shizuhiko Nishisato,et al.  Elements of Dual Scaling: An Introduction To Practical Data Analysis , 1993 .

[53]  A. S. Schistad Solberg,et al.  A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[54]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[55]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[56]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[57]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[58]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[59]  Marc S. Sherman,et al.  Calmodulin Target Database , 2004, Journal of Structural and Functional Genomics.

[60]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[61]  L. Iakoucheva,et al.  Intrinsic Disorder and Protein Function , 2002 .

[62]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[63]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[64]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[65]  Vladimir N Uversky,et al.  What does it mean to be natively unfolded? , 2002, European journal of biochemistry.

[66]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[67]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[68]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[69]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[70]  Piero Fariselli,et al.  Prediction of the Number of Residue Contacts in Proteins , 2000, ISMB.

[71]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[72]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[73]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[74]  J. MacKinnon,et al.  Estimation and inference in econometrics , 1994 .