Semi-automated categorization of open-ended questions

Text data from open-ended questions in surveys are difficult to analyze and are frequently ignored. Yet open-ended questions are important because they do not constrain respondents’ answer choices. Where open-ended questions are necessary, sometimes multiple human coders hand-code answers into one of several categories. At the same time, computer scientists have made impressive advances in text mining that may allow automation of such coding. Automated algorithms do not achieve an overall accuracy high enough to entirely replace humans. We categorize open-ended questions soliciting narrative responses using text mining for easy-to-categorize answers and humans for the remainder using expected accuracies to guide the choice of the threshold delineating between “easy” and “hard”. Employing multinomial boosting avoids the common practice of converting machine learning “confidence scores” into pseudo-probabilities. This approach is illustrated with examples from open-ended questions related to respondents’ advice to a patient in a hypothetical dilemma, a follow-up probe related to respondents’ perception of disclosure/privacy risk, and from a question on reasons for quitting smoking from a follow-up survey from the Ontario Smoker’s Helpline. Targeting 80% combined accuracy, we found that 54%-80% of the data could be categorized automatically in research surveys.

[1]  Thomas Roelleke,et al.  Document Difficulty Framework for Semi-automatic Text Classification , 2013, DaWaK.

[2]  Fabrizio Sebastiani,et al.  Automating survey coding by multiclass text categorization techniques , 2003, J. Assoc. Inf. Sci. Technol..

[3]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[4]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[5]  Ann C. Haas,et al.  Patient Activation and Advocacy: Which Literacy Skills Matter Most? , 2011, Journal of health communication.

[6]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[7]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[8]  Tim Macer Cracking the Code: {W}hat customers say, in their own words , 2007 .

[9]  Matthias Schonlau,et al.  Text Mining Using N-Grams , 2016 .

[10]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[11]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[12]  Mick P Couper,et al.  Risk of Disclosure, Perceptions of Risk, and Concerns about Privacy and Confidentiality as Factors in Survey Participation. , 2008, Journal of official statistics.

[13]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[14]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[15]  Matthias Schonlau,et al.  Boosted Regression (Boosting): An Introductory Tutorial and a Stata Plugin , 2005 .

[16]  John G. Geer,et al.  DO OPEN-ENDED QUESTIONS MEASURE “SALIENT” ISSUES? , 1991 .

[17]  Andrea Esuli,et al.  Machines that Learn how to Code Open-Ended Survey Data , 2010 .

[18]  Sirvan Yahyaei,et al.  Semi-automatic Document Classification: Exploiting Document Difficulty , 2012, ECIR.