Comparison of classification methods on protein-protein interaction document classification

Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.

[1]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[2]  Peter Uetz,et al.  MPIDB: the microbial protein interaction database , 2008, Bioinform..

[3]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Anca L. Ralescu,et al.  The Effect of Imbalanced Data Class Distribution on Fuzzy Classifiers - Experimental Study , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[8]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[9]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[10]  Xingquan Zhu,et al.  Lazy Bagging for Classifying Imbalanced Data , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[11]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[12]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[13]  Taghi M. Khoshgoftaar,et al.  Mining Data with Rare Events: A Case Study , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[14]  Hasan Davulcu,et al.  IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text , 2005, LBLODMBS@IDMB.

[15]  Ting Yu,et al.  A Hierarchical VQSVM for Imbalanced Data Sets , 2007, 2007 International Joint Conference on Neural Networks.

[16]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[17]  Manoranjan Dash,et al.  An Evaluation of Progressive Sampling for Imbalanced Data Sets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[18]  Hongfang Liu,et al.  A Study of Text Categorization for Model Organism Databases , 2004, HLT-NAACL 2004.

[19]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[20]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[21]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[22]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[23]  Wen-Lian Hsu,et al.  Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles , 2008, BMC Bioinformatics.

[24]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[25]  Taghi M. Khoshgoftaar,et al.  Mining Data with Rare Events: A Case Study , 2007 .

[26]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[27]  Sanjay Chawla,et al.  Using Significant, Positively Associated and Relatively Class Correlated Rules for Associative Classification of Imbalanced Datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[28]  Jian Su,et al.  Protein-Protein Interaction Extraction: A Supervised Learning Approach} , 2005 .

[29]  Desmond J. Higham,et al.  Fitting a geometric graph to a protein-protein interaction network , 2008, Bioinform..

[30]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[31]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[32]  Mario Molinara,et al.  Facing Imbalanced Classes through Aggregation of Classifiers , 2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007).

[33]  Yue-Shi Lee,et al.  Investigating the Effect of Sampling Methods for Imbalanced Data Distributions , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[34]  Cheng G. Weng,et al.  A Data Complexity Analysis on Imbalanced Datasets and an Alternative Imbalance Recovering Strategy , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[35]  Hongfang Liu,et al.  Document Classification for Mining Host Pathogen Protein-Protein Interactions , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[36]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[37]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.