Classifying protein-protein interaction articles from biomedical literature using many relevant features and context-free grammar

Abstract Detecting the articles which consist of protein–protein interactions (PPI) is a significant step in biological information extraction. In this paper, we present a hybrid text classification (TC) method to identify protein–protein interaction articles. Our methodology comprises of four modules i) Feature extraction, ii) Semantic similarity based feature selection iii) Ensemble learning and iv) Context free grammar (CFG) based post processing to classify PPI relevant articles. In first module, we extracted many linguistic and domain specific features such as protein names, interaction cues etc., to classify the documents. The second module used similarity based feature selection to extract the relevant efficient features. In third module, we employed AdaBoost based ensemble learning to improve the performance of weak learning classifiers. The final module incorporates CFG based pattern matching to resolve the errors in the classifiers. The performance of our hybrid TC method was trained and tested on BioCreative III corpus in which we attained the precision of 0.5813 and recall of 0.6582. The overall F-score of the system was 0.6228 and our hybrid approach combined with ensemble classifier and CFG post-processing method outperforms most of the state of-the-art systems.

[1]  José Luís Oliveira,et al.  Classification methods for finding articles describing protein-protein interactions in PubMed , 2011, J. Integr. Bioinform..

[2]  Carlotta Domeniconi,et al.  Weighted-object ensemble clustering: methods and analysis , 2016, Knowledge and Information Systems.

[3]  Zhiyong Lu,et al.  Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases , 2011 .

[4]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[5]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[6]  Feng Wu,et al.  A discriminative and semantic feature selection method for text categorization , 2015 .

[7]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[8]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[9]  Georgios A. Pavlopoulos,et al.  Protein-protein interaction predictions using text mining methods. , 2015, Methods.

[10]  Hong Yu,et al.  Simple and efficient machine learning frameworks for identifying protein-protein interaction relevant articles and experimental methods used to study the interactions , 2011, BMC Bioinformatics.

[11]  Timothy A. Gonsalves,et al.  Feature Selection for Text Classification Based on Gini Coefficient of Inequality , 2010, FSDM.

[12]  Hong Shen,et al.  Weighted Ensemble Classification of Multi-label Data Streams , 2017, PAKDD.

[13]  Kalpana Raja,et al.  PPInterFinder—a mining tool for extracting causal relations on human proteins from literature , 2013, Database J. Biol. Databases Curation.

[14]  José A. Reyes,et al.  Prediction of protein-protein interaction types using association rule based classification , 2009, BMC Bioinformatics.

[15]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[16]  Ioannis Hatzilygeroudis,et al.  Recognizing emotions in text using ensemble of classifiers , 2016, Eng. Appl. Artif. Intell..

[17]  Sophia Ananiadou,et al.  Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature , 2011, BMC Bioinformatics.

[18]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[20]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[21]  Aytug Onan,et al.  An ensemble scheme based on language function analysis and feature engineering for text genre classification , 2018, J. Inf. Sci..

[22]  Yifei Chen,et al.  Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection , 2015, BioMed research international.

[23]  Sylvie Ranwez,et al.  Semantic Similarity from Natural Language and Ontology Analysis , 2015, Synthesis Lectures on Human Language Technologies.

[24]  Jasleen Kaur,et al.  Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features , 2010, TCBB.

[25]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[26]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[27]  Aytug Onan,et al.  A feature selection model based on genetic rank aggregation for text sentiment classification , 2017, J. Inf. Sci..

[28]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[29]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[30]  Ping Hou,et al.  An ensemble self-training protein interaction article classifier. , 2014, Bio-medical materials and engineering.

[31]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[32]  Aytug Onan,et al.  Hybrid supervised clustering based ensemble scheme for text classification , 2017, Kybernetes.

[33]  Sujata Dash,et al.  Ensemble based Hybrid Machine Learning Approach for Sentiment Classification- A Review , 2016 .

[34]  Martin Hofmann-Apitius,et al.  Improving Distantly Supervised Extraction of Drug-Drug and Protein-Protein Interactions , 2012 .

[35]  Elizabeth D. Liddy,et al.  Text categorization for multiple users based on semantic features from a machine-readable dictionary , 1994, TOIS.

[36]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[37]  Joel Adams,et al.  Automatic Classification of PubMed Abstracts with Latent Semantic Indexing: Working Notes , 2014, CLEF.

[38]  Masaki Murata,et al.  Extracting Protein-Protein Interaction Information from Biomedical Text with SVM , 2006, IEICE Trans. Inf. Syst..

[39]  Aytug Onan,et al.  A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification , 2016, Expert Syst. Appl..

[40]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013 , 2015, BMC Bioinformatics.

[41]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[42]  W. John Wilbur,et al.  Classifying protein-protein interaction articles using word and syntactic features , 2011, BMC Bioinformatics.

[43]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[44]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[45]  Martin H. Schaefer,et al.  HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks , 2016, Nucleic Acids Res..

[46]  Jun Yan,et al.  Large‐scale extraction of drug–disease pairs from the medical literature , 2017, J. Assoc. Inf. Sci. Technol..

[47]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[48]  Lishuang Li,et al.  A Two-Stage Biomedical Event Trigger Detection Method Integrating Feature Selection and Word Embeddings , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[49]  Manuel J. Maña López,et al.  Attribute analysis in biomedical text classification , 2007 .

[50]  G. L. Prajapati,et al.  Applying Bi-Directional Search Strategy in Selected String Matching Algorithms , 2016 .

[51]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.