Prediction of Implicit Protein-Protein Interaction by Optimal Associative Feature Mining

Proteins are known to perform a biological function by interacting with other proteins or compounds. Since protein-protein interaction is intrinsic to most cellular processes, protein interaction prediction is an important issue in post-genomic biology where abundant interaction data has been produced by many research groups. In this paper, we present an associative feature mining method to predict implicit protein-protein interactions of S.cerevisiae from public protein-protein interaction data. To overcome the dimensionality problem of conventional data mining approach, we employ feature dimension reduction filter (FDRF) method based on the information theory to select optimal informative features and to speed up the overall mining procedure. As a mining method to predict interaction, we use association rule discovery algorithm for associative feature and rule mining. Using the discovered associative feature we predict implicit protein interactions which have not been observed in training data. According to the experimental results, the proposed method accomplishes about 94.8% prediction accuracy with reduced computation time which is 32.5% faster than conventional method that has no feature filter.

[1]  Joan Brooks,et al.  Three yeast proteome databases: YPD, PombePD, and CalPD (MycoPathPD). , 2002, Methods in enzymology.

[2]  Satoru Kuhara,et al.  Extraction of Substructures of Proteins Essential to their Biological Functions by a Data Mining Technique , 1997, ISMB.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[5]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms , 2004, Nucleic Acids Res..

[7]  William H. Press,et al.  Numerical recipes in C , 2002 .

[8]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[9]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[10]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[11]  Kenji Satou,et al.  Extraction of knowledge on protein-protein interaction by association rule discovery , 2002, Bioinform..

[12]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[13]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[14]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.