Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature

BackgroundThe selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest’s Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles.ResultsWe proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task’s development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew’s Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics.ConclusionsOur novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.

[1]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[2]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[3]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[4]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[5]  Naoaki Okazaki,et al.  Semantic Search on Digital Document Repositories based on Text Mining Results , 2009 .

[6]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[7]  M. Romacker,et al.  OntoGene in BioCreative II , 2007, Genome Biology.

[8]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Sophia Ananiadou,et al.  Disambiguating the species of biomedical named entities using natural language parsers , 2010, Bioinform..

[11]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[12]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[13]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[14]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.

[15]  B. Haddow,et al.  Automating curation using a natural language processing pipeline , 2008, Genome Biology.

[16]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[17]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[18]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[20]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[21]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[22]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[23]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[24]  F Rinaldi,et al.  OntoGene in BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Hans-Michael Müller,et al.  Automatic document classification of biological literature , 2006, BMC Bioinformatics.

[26]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[27]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[28]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[29]  Chun-Nan Hsu,et al.  Exploring Match Scores to Boost Precision of Gene Normalization. , 2007 .

[30]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.