Linguistic and Statistically Derived Features for Cause of Death Prediction from Verbal Autopsy Text

Automatic Text Classification (ATC) is an emerging technology with economic importance given the unprecedented growth of text data. This paper reports on work in progress to develop methods for predicting Cause of Death from Verbal Autopsy (VA) documents recommended for use in low-income countries by the World Health Organisation. VA documents contain both coded data and open narrative. The task is formulated as a Text Classification problem and explores various combinations of linguistic and statistical approaches to determine how these may improve on the standard bag-of-words approach using a dataset of over 6400 VA documents that were manually annotated with cause of death. We demonstrate that a significant improvement of prediction accuracy can be obtained through a novel combination of statistical and linguistic features derived from the VA text. The paper explores the methods by which ATC may leads to improved accuracy in Cause of Death prediction.

[1]  Samuel Danso,et al.  A Comparative Study of Machine Learning Methods for Verbal Autopsy Text Classification , 2014, ArXiv.

[2]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[3]  Eric Wehrli,et al.  Extraction of multi-word collocations using syntactic bigram composition , 2003 .

[4]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[5]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[6]  Peter Byass,et al.  Refining a probabilistic model for interpreting verbal autopsy data , 2006, Scandinavian journal of public health.

[7]  S. Tollman,et al.  Validation and application of verbal autopsies in a rural area of South Africa , 2000, Tropical medicine & international health : TM & IH.

[8]  Samuel Danso,et al.  A semantically annotated Verbal Autopsy corpus for automatic analysis of cause of death. , 2013 .

[9]  Gonghuan Yang,et al.  Validation of the Symptom Pattern Method for Analyzing Verbal Autopsy Data , 2007, PLoS medicine.

[10]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[11]  Hiroya Takamura,et al.  Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees , 2005, PAKDD.

[12]  Serguei V. S. Pakhomov,et al.  Electronic medical records for clinical research: application to the identification of heart failure. , 2007, The American journal of managed care.

[13]  Yorick Wilks,et al.  Word Sense Disambiguation using Optimised Combinations of Knowledge Sources , 1998, COLING-ACL.

[14]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[15]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[16]  Azadeh Nikfarjam,et al.  Pattern mining for extraction of mentions of Adverse Drug Reactions from user comments. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[17]  Jian Yang,et al.  Towards Internet-Age Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts in Health-Related Social Networks , 2010, BioNLP@ACL.

[18]  Michael Gamon,et al.  Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis , 2004, COLING.

[19]  Bradley J Nelson,et al.  Anterior shoulder stabilization in collision athletes: arthroscopic versus open Bankart repair. , 2007, The American journal of sports medicine.

[20]  Z. Harris,et al.  Methods in structural linguistics. , 1952 .

[21]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[22]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[23]  Peter Byass,et al.  Moving from Data on Deaths to Public Health Policy in Agincourt, South Africa: Approaches to Analysing and Understanding Verbal Autopsy Findings , 2010, PLoS medicine.

[24]  Serguei V. S. Pakhomov,et al.  Automatic Quality of Life Prediction Using Electronic Medical Records , 2008, AMIA.

[25]  Aaron M. Cohen,et al.  An Effective General Purpose Approach for Automated Biomedical Document Classification , 2006, AMIA.

[26]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[27]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[28]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[29]  M. González Rodríguez,et al.  Proceedings of the third International Conference on Language Resources and Evaluation , 2002 .

[30]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[31]  Gary King,et al.  Designing verbal autopsy studies , 2010, Population health metrics.

[32]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[33]  Nitin Indurkhya,et al.  Handbook of Natural Language Processing , 2010 .

[34]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[35]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[36]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[37]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.