Machines that Learn how to Code Open-Ended Survey Data

In the last seven years we have carried out experimental research aimed at developing software that automatically codes open-ended survey responses. These projects have led to the generation of an industrial-strength software package now in operation at the Customer Insight division of a large international banking group, and now integrated into a widely-used software platform for the management of open-ended survey data. This software, which can code data at a rate of tens of thousands of open-ended responses per hour, and that can address responses formulated in any of ve major European languages, is the result of contributions from dierent elds of computer science, including Information Retrieval, Machine Learning, Computational Linguistics, and Opinion Mining. Our approach is based on a learning metaphor, whereby automated verbatim coders are automatically generated by a general-purpose process that learns, from a user-provided sample of manually coded verbatims, the characteristics that new, uncoded verbatims should have in order to be attributed the codes in the codeframe. In this paper we discuss the basic philosophy underlying this software. In a forthcoming companion paper we present the results of experiments we have run on several datasets of real respondent data in which we have compared the accuracy of the software against the accuracy of human coders.

[1]  Fabrizio Sebastiani,et al.  Automating survey coding by multiclass text categorization techniques , 2003, J. Assoc. Inf. Sci. Technol..

[2]  Fabrizio Sebastiani,et al.  Selecting negative examples for hierarchical text classification: An experimental comparison , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Andrea Esuli,et al.  PageRanking WordNet Synsets: An Application to Opinion Mining , 2007, ACL.

[4]  Fabrizio Sebastiani Classification of Text, Automatic , 2006 .

[5]  Andrea Esuli,et al.  Encoding Ordinal Features into Binary Features for Text Classification , 2009, ECIR.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  Patrick Sturgis,et al.  The Effect of Coding Error on Time Use Surveys Estimates , 2004 .

[8]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[9]  Fabrizio Sebastiani,et al.  Cluster Generation and Labeling for Web Snippets: A Fast, Accurate Hierarchical Solution , 2006, Internet Math..

[10]  Andrea Esuli,et al.  Multi-Faceted Rating of Product Reviews , 2009, ERCIM News.

[11]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[12]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[13]  Tim Macer,et al.  Cracking the Code: What customers say in their own words , 2007 .

[14]  Andrea Esuli,et al.  Boosting multi-label hierarchical text categorization , 2008, Information Retrieval.

[15]  Alessandro Sperduti,et al.  Discretizing Continuous Attributes in AdaBoost for Text Categorization , 2003, ECIR.

[16]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[17]  Andrea Esuli,et al.  Training Data Cleaning for Text Classification , 2009, ICTIR.

[18]  Per Capita,et al.  About the authors , 1995, Machine Vision and Applications.

[19]  F. Sebastiani,et al.  Random-Walk Models of Term Semantics: An Application to Opinion-Related Properties , 2007 .

[20]  Andrea Esuli,et al.  Determining Term Subjectivity and Term Orientation for Opinion Mining , 2006, EACL.

[21]  Andrea Esuli,et al.  Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[22]  Shlomo Argamon,et al.  Automatically Determining Attitude Type and Force for Sentiment Analysis , 2007, LTC.

[23]  Andrea Esuli,et al.  Determining the semantic orientation of terms through gloss analysis , 2005, CIKM 2005.

[24]  Fabrizio Sebastiani,et al.  On the Selection of Negative Examples for Hierarchical Text Categorization , 2007 .

[25]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[26]  Vasja Vehovar,et al.  Open-ended vs. close-ended questions in Web questionnaires , 2003 .

[27]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[28]  Stanley Presser,et al.  The Open and Closed Question , 1979 .

[29]  Andrea Esuli,et al.  Determining the semantic orientation of terms through gloss classification , 2005, CIKM '05.

[30]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[31]  C. K. Ogden,et al.  The Meaning of Meaning , 1923 .

[32]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[33]  K. Manfreda,et al.  Open-ended vs. , 2003 .

[34]  Robert Asher,et al.  The Encyclopedia of Language and Linguistics , 1995 .