An experimental study in automatically categorizing medical documents

In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents. The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on well-known information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,569 documents, we verify that the algorithm attains levels of average precision in the 70–80% range for category coding and in the 60–70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists.

[1]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[2]  Peter Spyns Natural Language Processing in Medicine: An Overview , 1996, Methods of Information in Medicine.

[3]  Ian Graham,et al.  Expert Systems: Knowledge, Uncertainty and Decision , 1988 .

[4]  William R. Hersh,et al.  Information Retrieval in Medicine: The SAPHIRE Experience , 1995 .

[5]  W. Bruce Croft,et al.  Combining automatic and manual index representations in probabilistic retrieval , 1995 .

[6]  P M Pietrzyk,et al.  A Medical Text Analysis System for German - Syntax Analysis , 1991, Methods of Information in Medicine.

[7]  Y Satomura,et al.  Automated diagnostic indexing by natural language processing. , 1992, Medical informatics = Medecine et informatique.

[8]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[9]  C. Chute,et al.  The content coverage of clinical classifications. For The Computer-Based Patient Record Institute's Work Group on Codes & Structures. , 1996, Journal of the American Medical Informatics Association : JAMIA.

[10]  R A Greenes,et al.  SAPHIRE--an information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. , 1990, Computers and biomedical research, an international journal.

[11]  J J Cimino Data storage and knowledge representation for clinical workstations. , 1994, International journal of bio-medical computing.

[12]  M Roux,et al.  Representation of medical concepts of the thyroid gland by physicians in anatomy and pathology. , 1994, Methods of information in medicine.

[13]  William R. Hersh The Electronic Medical Record: Promises and Problems , 1995 .

[14]  Berthier A. Ribeiro-Neto,et al.  A hierarchical approach to the automatic categorization of medical documents , 1998, CIKM '98.

[15]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[16]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[17]  C G Chute,et al.  An application of Expert Network to clinical classification and MEDLINE indexing. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[18]  K. A. McKibbon,et al.  Online access to MEDLINE in clinical settings. A study of use and usefulness. , 1990, Annals of internal medicine.

[19]  James J. Cimino,et al.  Vocabulary and Health Care Information Technology: State of the Art , 1995 .

[20]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[21]  William R. Hersh,et al.  An evaluation of interactive Boolean and natural language searching with an online medical textbook , 1995 .

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  L. Tick,et al.  Medical Language Processing: Applications to Patient Data Representation and Automatic Encoding , 1995, Methods of Information in Medicine.

[24]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[25]  D J Rothwell,et al.  Developing a standard data structure for medical language--the SNOMED proposal. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[26]  A Burgun,et al.  Automated Coding of Patient Discharge Summaries Using Conceptual Graphs , 1995, Methods of Information in Medicine.

[27]  W. Bruce Croft,et al.  Automated classification of encounter notes in a computer based medical record. , 1995, Medinfo. MEDINFO.

[28]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[29]  R Clark,et al.  Natural Language Processing, Lexicon and Semantics , 1995, Methods of Information in Medicine.

[30]  W. G. Cole,et al.  Metaphrase: An Aid to the Clinical Conceptualization and Formalization of Patient Problems in Healthcare Enterprises , 1998, Methods of Information in Medicine.