Using Dictionaries for Biomedical Text Classification

The purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative [13], NLPBA [8] and a subset of the UniProt database [4], named Protein) and three types of classifiers (KNN, SVM and Naive-Bayes) when they are applied to search on the PubMed database. Dictionaries have been used during the preprocessing and annotation of documents. The best results were obtained with the NLPBA and Protein dictionaries and the SVM classifier.

[1]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[2]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[3]  Ted Briscoe,et al.  The Derivation of a Grammatically Indexed Lexicon from the Longman Dictionary of Contemporary English , 1987, ACL.

[4]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[5]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[6]  Ashish Sureka,et al.  Semantic Based Text Classification of Patent Documents to a User-Defined Taxonomy , 2009, ADMA.

[7]  Hamish Cunningham GATE, a General Architecture for Text Engineering , 2002 .

[8]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Luis Mateus Rocha,et al.  Biomedical Article Classification Using an Agent-Based Model of T-Cell Cross-Regulation , 2010, ICARIS.

[11]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[12]  Ying Liu,et al.  Using WordNet to Disambiguate Word Senses for Text Classification , 2007, International Conference on Computational Science.

[13]  Sungzoon Cho,et al.  EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems , 2006, ICONIP.

[14]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[15]  Nigel Collier,et al.  Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[18]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[19]  Xiaoyue Wang,et al.  Extract Semantic Information from WordNet to Improve Text Classification Performance , 2010, AST/UCMA/ISA/ACN.

[20]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[21]  Mark Dredze,et al.  TREC 2005 Genomics Track Experiments at IBM Watson , 2005, TREC.