Classification of forensic autopsy reports through conceptual graph-based document representation model

Text categorization has been used extensively in recent years to classify plain-text clinical reports. This study employs text categorization techniques for the classification of open narrative forensic autopsy reports. One of the key steps in text classification is document representation. In document representation, a clinical report is transformed into a format that is suitable for classification. The traditional document representation technique for text categorization is the bag-of-words (BoW) technique. In this study, the traditional BoW technique is ineffective in classifying forensic autopsy reports because it merely extracts frequent but discriminative features from clinical reports. Moreover, this technique fails to capture word inversion, as well as word-level synonymy and polysemy, when classifying autopsy reports. Hence, the BoW technique suffers from low accuracy and low robustness unless it is improved with contextual and application-specific information. To overcome the aforementioned limitations of the BoW technique, this research aims to develop an effective conceptual graph-based document representation (CGDR) technique to classify 1500 forensic autopsy reports from four (4) manners of death (MoD) and sixteen (16) causes of death (CoD). Term-based and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) based conceptual features were extracted and represented through graphs. These features were then used to train a two-level text classifier. The first level classifier was responsible for predicting MoD. In addition, the second level classifier was responsible for predicting CoD using the proposed conceptual graph-based document representation technique. To demonstrate the significance of the proposed technique, its results were compared with those of six (6) state-of-the-art document representation techniques. Lastly, this study compared the effects of one-level classification and two-level classification on the experimental results. The experimental results indicated that the CGDR technique achieved 12% to 15% improvement in accuracy compared with fully automated document representation baseline techniques. Moreover, two-level classification obtained better results compared with one-level classification. The promising results of the proposed conceptual graph-based document representation technique suggest that pathologists can adopt the proposed system as their basis for second opinion, thereby supporting them in effectively determining CoD.

[1]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[2]  Kasturi Dewi Varathan,et al.  Using online social networks to track a pandemic: A systematic review , 2016, J. Biomed. Informatics.

[3]  Anthony N. Nguyen,et al.  Automatic ICD-10 classification of cancers from free-text death certificates , 2015, Int. J. Medical Informatics.

[4]  Shaun J. Grannis,et al.  Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data , 2017, J. Biomed. Informatics.

[5]  Fernando Enríquez,et al.  An approach to the use of word embeddings in an opinion classification task , 2016, Expert Syst. Appl..

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[8]  Amir Hussain,et al.  A novel ontology and machine learning driven hybrid cardiovascular clinical prognosis as a complex adaptive clinical system , 2016, Complex Adapt. Syst. Model..

[9]  R. Lyman Ott.,et al.  An introduction to statistical methods and data analysis , 1977 .

[10]  Omolola A. Adedokun,et al.  Analysis of Paired Dichotomous Data: A Gentle Introduction to the McNemar Test in SPSS , 2011, Journal of MultiDisciplinary Evaluation.

[11]  A Burgun,et al.  Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer , 2011, Methods of Information in Medicine.

[12]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[13]  Liyana Shuib,et al.  Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study. , 2017, Journal of forensic and legal medicine.

[14]  Liyana Shuib,et al.  Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection , 2017, PloS one.

[15]  U. Rajendra Acharya,et al.  Automated characterization of coronary artery disease, myocardial infarction, and congestive heart failure using contourlet and shearlet transforms of electrocardiogram signal , 2017, Knowl. Based Syst..

[16]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[17]  Anastasios Tefas,et al.  Entropy Optimized Feature-Based Bag-of-Words Representation for Information Retrieval , 2016, IEEE Transactions on Knowledge and Data Engineering.

[18]  Nicolette de Keizer,et al.  Forty years of SNOMED: a literature review , 2008, BMC Medical Informatics Decis. Mak..

[19]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[20]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[21]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[22]  Min Song,et al.  Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Georgios Paliouras,et al.  Graph vs. bag representation models for the topic classification of web documents , 2016, World Wide Web.

[24]  U. Rajendra Acharya,et al.  Automated detection of coronary artery disease using different durations of ECG segments with convolutional neural network , 2017, Knowl. Based Syst..

[25]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[26]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[27]  Jelena Graovac A variant of n-gram based language- independent text categorization , 2014, Intell. Data Anal..

[28]  Lucila Ohno-Machado,et al.  A Comparison of Machine Learning Methods for the Diagnosis of Pigmented Skin Lesions , 2001, J. Biomed. Informatics.

[29]  Anthony N. Nguyen,et al.  Automatic Classification of Free-Text Radiology Reports to Identify Limb Fractures using Machine Learning and the SNOMED CT Ontology , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[30]  Nicolette de Keizer,et al.  A survey of SNOMED CT implementations , 2012, Journal of Biomedical Informatics.

[31]  Muhammad Faisal Siddiqui,et al.  An Automated and Intelligent Medical Decision Support System for Brain MRI Scans Classification , 2015, PloS one.

[32]  Evie McCrum-Gardner,et al.  Which is the correct statistical test to use? , 2008, The British journal of oral & maxillofacial surgery.

[33]  Ricardo Gutierrez-Osuna,et al.  Pattern analysis for machine olfaction: a review , 2002 .

[34]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[35]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[36]  Zhenchao Jiang,et al.  An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Namita Mittal,et al.  Prominent feature extraction for review analysis: an empirical study , 2016, J. Exp. Theor. Artif. Intell..

[38]  Ram Gopal Raj,et al.  An application of case-based reasoning with machine learning for forensic autopsy , 2014, Expert Syst. Appl..

[39]  Frans Coenen,et al.  Text classification using graph mining-based feature extraction , 2010 .

[40]  Stuart H. James,et al.  Forensic Science: An Introduction to Scientific and Investigative Techniques , 2002 .