Towards classifying species in systems biology papers using text mining

BackgroundIn recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology for formalizing this wealth of published results into structured database entries. However, database curation as a task is still largely done by hand, and although there have been many studies on automated approaches, problems remain in how to classify documents into top-level categories based on the type of organism being investigated. Here we present a comparative analysis of state of the art supervised models that are used to classify both abstracts and full text articles for three model organisms.ResultsAblation experiments were conducted on a large gold standard corpus of 10,000 abstracts and full papers containing data on three model organisms (fly, mouse and yeast). Among the eight learner models tested, the best model achieved an F-score of 97.1% for fly, 88.6% for mouse and 85.5% for yeast using a variety of features that included gene name, organism frequency, MeSH headings and term-species associations. We noted that term-species associations were particularly effective in improving classification performance. The benefit of using full text articles over abstracts was consistently observed across all three organisms.ConclusionsBy comparing various learner algorithms and features we presented an optimized system that automatically detects the major focus organism in full text articles for fly, mouse and yeast. We believe the method will be extensible to other organism types.

[1]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[2]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  William R. Hersh,et al.  A comparative analysis of retrieval features used in the TREC 2006 Genomics Track passage retrieval task , 2007, AMIA.

[5]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[6]  Judith A. Blake,et al.  MGD: the Mouse Genome Database , 2003, Nucleic Acids Res..

[7]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[8]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[9]  F Rinaldi,et al.  OntoGene in BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[11]  Xinglong Wang,et al.  Distinguishing the species of biomedical named entities for term identification , 2008, BMC Bioinformatics.

[12]  Hongfang Liu,et al.  A Study of Text Categorization for Model Organism Databases , 2004, HLT-NAACL 2004.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[16]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[17]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[18]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in full text articles , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[19]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[20]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[21]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[22]  M. Romacker,et al.  OntoGene in BioCreative II , 2007, Genome Biology.

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[25]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[26]  Haijia Shi Best-first Decision Tree Learning , 2007 .