An Experiment in Automatic Classification of Pathological Reports

Medical reports are predominantly written in natural language; as such they are not computer-accessible. A common way to make medical narrative accessible to automated systems is by assigning `computer-understandable' keywords from a controlled vocabulary. Experts usually perform this task by hand. In this paper, we investigate methods to support or automate this type of medical classification. We report on experiments using the PALGA data set, a collection of 14 million pathological reports, each of which has been classified by a domain expert. We describe methods for automatically categorizing the documents in this data set in an accurate way. In order to evaluate the proposed automatic classification approaches, we compare their output with that of two additional human annotators. While the automatic system performs well in comparison with humans, the inconsistencies within the annotated data constrain the maximum attainable performance.

[1]  Nicoletta Calzolari,et al.  Review of Medical language processing: computer management of narrative data by Naomi Sager, Carol Friedman, and Margaret S. Lyman. Addison-Wesley 1987. , 1989 .

[2]  Paul B. Kantor,et al.  A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap* , 1988 .

[3]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .

[4]  Stefan Schulz,et al.  Automated coding of diagnoses-three methods compared , 2000, AMIA.

[5]  M. Petticrew,et al.  Assessment of the reproducibility of clinical coding in routinely collected hospital activity data: a study in two hospitals. , 1998, Journal of public health medicine.

[6]  N R Lemoine,et al.  Comparison of manual data coding errors in two hospitals. , 1986, Journal of clinical pathology.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Yuval Shahar,et al.  Multiple hierarchical classification of free-text clinical guidelines , 2006, Artif. Intell. Medicine.

[9]  D P Lorence,et al.  Managers reports of automated coding system adoption and effects on data quality. , 2003, Methods of information in medicine.

[10]  Cécile Viboud,et al.  Automatic coding of reasons for hospital referral from general medicine free-text reports , 2000, AMIA.

[11]  F Wingert,et al.  Automated Indexing Based on SNOMED , 1985, Methods of Information in Medicine.

[12]  A. Rector Clinical Terminology: Why Is it so Hard? , 1999, Methods of Information in Medicine.

[13]  A. L. Rector Clinical terminology : Why is it so hard? : Challenges to Progresses , 1999 .

[14]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[15]  Berthier A. Ribeiro-Neto,et al.  An experimental study in automatically categorizing medical documents , 2001, J. Assoc. Inf. Sci. Technol..

[16]  H Ahlfeldt,et al.  Evaluation of Three Swedish ICD-10 Primary Care Versions: Reliability and Ease of Use in Diagnostic Coding , 2000, Methods of Information in Medicine.

[17]  J. Cimino Review Paper: Coding Systems in Health Care , 1995, Methods of Information in Medicine.

[18]  Mirja Iivonen,et al.  Consistency in the Selection of Search Concepts and Search Terms , 1995, Inf. Process. Manag..

[19]  L. M de Bruijn Automatic classification of pathology reports , 1997 .

[20]  D. P. Lorence,et al.  Managers Reports of Automated Coding System Adoption and Effects on Data Quality , 2003, Methods of Information in Medicine.

[21]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[23]  György Surján,et al.  Questions on validity of International Classification of Diseases-coded diagnoses , 1999, Int. J. Medical Informatics.

[24]  Alberto H. F. Laender,et al.  An experimental study in auomatically categorizing medical documents , 2001 .