Automatic classification of histopathological diagnoses for building a large scale tissue catalogue

In this paper an automatic classification system for pathological findings is presented. The starting point in our undertaking was a pathologic tissue collection with about 1.4 million tissue samples described by free text records over 23 years. Exploring knowledge out of this “big data” pool is a challenging task, especially when dealing with unstructured data spanning over many years. The classification is based on an ontology-based term extraction and decision tree build with a manually curated classification system. The information extracting system is based on regular expressions and a text substitution system. We describe the generation of the decision trees by medical experts using a visual editor. Also the evaluation of the classification process with a reference data set is described. We achieved an F-Score of 89,7% for ICD-10 and an F-Score of 94,7% for ICD-O classification. For the information extraction of the tumor staging and receptors we achieved am F-Score ranging from 81,8 to 96,8%.

[1]  P Zweigenbaum,et al.  Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest , 2016, Yearbook of Medical Informatics.

[2]  P. Trott,et al.  International Classification of Diseases for Oncology , 1977 .

[3]  Jay R. Harris,et al.  Diseases of the Breast , 2014 .

[4]  Katherine E Henson,et al.  Risk of Suicide After Cancer Diagnosis in England , 2018, JAMA psychiatry.

[5]  Hermann A. Maurer,et al.  Adaptive Visual Symbols for Personal Health Records , 2011, 2011 15th International Conference on Information Visualisation.

[6]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[7]  References , 1971 .

[8]  Siegfried Handschuh,et al.  On Designing Controlled Natural Languages for Semantic Annotation , 2009, CNL.

[9]  Heather A. Piwowar,et al.  Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers , 2008, PLoS medicine.

[10]  Johann Eder,et al.  Priority-Based k-Anonymity Accomplished by Weighted Generalisation Structures , 2006, DaWaK.

[11]  Hyoil Han,et al.  Approaches to text mining for clinical medical records , 2006, SAC '06.

[12]  H P Dinwoodie,et al.  Automatic disease coding: the 'fruit-machine' method in general practice. , 1973, British journal of preventive & social medicine.

[13]  Andreas Holzinger,et al.  Analysis of biomedical data with multilevel glyphs , 2014, BMC Bioinformatics.

[14]  Anthony N. Nguyen,et al.  Automatic ICD-10 classification of cancers from free-text death certificates , 2015, Int. J. Medical Informatics.

[15]  Johann Eder,et al.  Information Systems for Federated Biobanks , 2009, Trans. Large Scale Data Knowl. Centered Syst..

[16]  Abdelkader Hameurlain,et al.  Transactions on Large-Scale Data- and Knowledge-Centered Systems I , 2009, Trans. Large-Scale Data- and Knowledge-Centered Systems.