Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings(®) (MeSH(®)) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI) based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for one hundred MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures, performance is comparable, and for one measure, JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule based) might be combined and then evaluated showing they are complementary to one another.

[1]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[2]  Christian Lovis,et al.  Automatic medical encoding with SNOMED categories , 2008, BMC Medical Informatics Decis. Mak..

[3]  S J Darmoni,et al.  Simplified access to MeSH tree structures on CISMeF. , 1999, Bulletin of the Medical Library Association.

[4]  Bruce McGregor,et al.  Constructing a concise medical taxonomy. , 2005, Journal of the Medical Library Association : JMLA.

[5]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006 .

[6]  Halil Kilicoglu,et al.  Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease , 2006, BMC Bioinformatics.

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Joyce A. Mitchell,et al.  Using literature-based discovery to identify disease candidate genes , 2005, Int. J. Medical Informatics.

[9]  Aurélie Névéol,et al.  Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway. , 2004, Health information and libraries journal.

[10]  Susanne M. Humphrey,et al.  Automatic Indexing of Documents from Journal Descriptors: A Preliminary Investigation , 1999, J. Am. Soc. Inf. Sci..

[11]  Stéfan Jacques Darmoni,et al.  Evaluation of Meta-Concepts for Information Retrieval in a Quality-Controlled Health Gateway , 2007, AMIA.

[12]  Constantin F. Aliferis,et al.  Studies in Health Technology and Informatics , 2007 .

[13]  Chris J. Lu,et al.  Journal Descriptor Indexing Tool for Categorizing Text According to Discipline or Semantic Type , 2006, AMIA.

[14]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[15]  Allen C Browne,et al.  A method for verifying a vector-based text classification system. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Halil Kilicoglu,et al.  Semantic Relations Asserting the Etiology of Genetic Diseases , 2003, AMIA.

[18]  Susanne M. Humphrey,et al.  A New Approach to Automatic Indexing Using Journal Descriptors. , 1998 .