Imbalanced text classification on host pathogen protein-protein interaction documents

important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.

[1]  B. Chromy,et al.  Host–pathogen interactions: a proteomic view , 2005, Expert review of proteomics.

[2]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[3]  Mario Molinara,et al.  Facing Imbalanced Classes through Aggregation of Classifiers , 2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007).

[4]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[5]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[6]  Qiang Yang,et al.  Dynamic Refinement of Feature Weights Using Quantitative Introspective Learning , 1999, IJCAI.

[7]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[8]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[9]  Bei Yu,et al.  An evaluation of text classification methods for literary study , 2008, Lit. Linguistic Comput..

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Ting Yu,et al.  A Hierarchical VQSVM for Imbalanced Data Sets , 2007, 2007 International Joint Conference on Neural Networks.

[12]  Cheng G. Weng,et al.  A Data Complexity Analysis on Imbalanced Datasets and an Alternative Imbalance Recovering Strategy , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[13]  Chandrika Kamath,et al.  Feature selection in scientific applications , 2004, KDD.

[14]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[15]  Manoranjan Dash,et al.  An Evaluation of Progressive Sampling for Imbalanced Data Sets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[16]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[17]  George Karypis,et al.  A Feature Weight Adjustment Algorithm for Document Categorization , 2000 .

[18]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[19]  Hongfang Liu,et al.  A Study of Text Categorization for Model Organism Databases , 2004, HLT-NAACL 2004.

[20]  Yue-Shi Lee,et al.  Investigating the Effect of Sampling Methods for Imbalanced Data Distributions , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  By Bei,et al.  An Evaluation of Text Classification Methods for Literary Study , 2022 .

[23]  Martin Romacker,et al.  Creating Knowledge Repositories from Biomedical Reports: The MEDSYNDIKATE Text Mining System , 2001, Pacific Symposium on Biocomputing.

[24]  Hongfang Liu,et al.  Document Classification for Mining Host Pathogen Protein-Protein Interactions , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[25]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[26]  Sanjay Chawla,et al.  Using Significant, Positively Associated and Relatively Class Correlated Rules for Associative Classification of Imbalanced Datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[28]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[30]  Sheng Yang He,et al.  A Bacterial Virulence Protein Suppresses Host Innate Immunity to Cause Plant Disease , 2006, Science.

[31]  Taghi M. Khoshgoftaar,et al.  Mining Data with Rare Events: A Case Study , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[32]  Anca L. Ralescu,et al.  The Effect of Imbalanced Data Class Distribution on Fuzzy Classifiers - Experimental Study , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[33]  Silvio Romero de Lemos Meira,et al.  Comparative Study of Clustering Techniques for the Organization of Software Repositories , 2007 .

[34]  Xingquan Zhu,et al.  Lazy Bagging for Classifying Imbalanced Data , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[35]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..