Improving the Accuracy of Automated Occupation Coding at Any Production Rate

Occupation coding, an important task in official statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually at great expense. We propose two new methods for automatic coding: a hybrid method that combines a rule-based approach based on duplicates with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that both methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. We also find that statistical learning is improved by combining separate models for the detailed occupation codes and for aggregate occupation codes. Further, we and defing duplicates based on n-gram variables (a concept from text mining) is preferable to one based on exact string matches.

[1]  Christin Wirth,et al.  High Dimensional Indexing Transformational Approaches To High Dimensional Range And Similarity Searches , 2016 .

[2]  Elena Meschi,et al.  Measurement Error in Occupational Coding:An Analysis on Share Data , 2014 .

[3]  Kwan-Yuet Ho,et al.  Computer-Based Coding of Occupation Codes for Epidemiological Analyses , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.

[4]  K. Tijdens Dropout Rates and Response Times of an Occupation Search Tree in a Web Survey , 2014 .

[5]  Malte Schierholz,et al.  Automating Survey Coding for Occupation , 2014 .

[6]  D. Iezzi,et al.  An Application of Text Mining Technique for the Census of Nonprofit Institutions , 2014 .

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Martin S. Wasmer German General Social Survey 2012: English Translation of the German "ALLBUS"-Questionnaire , 2014 .

[9]  Matthew Thompson,et al.  Creating an Automated Industry and Occupation Coding Process for the American Community Survey , 2012 .

[10]  Ranjan Maitra,et al.  A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere , 2010 .

[11]  Sung-Hyon Myaeng,et al.  A Web-Based Automated System for Industry and Occupation Coding , 2008, WISE.

[12]  A. Ferrillo,et al.  Different quality tests on the automatic coding procedure for the Economic Activities descriptions , 2008 .

[13]  S. Milham,et al.  A computer system for coding occupation. , 2006, American journal of industrial medicine.

[14]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[15]  A. Koch,et al.  Der ALLBUS als Instrument zur Untersuchung sozialen Wandels: Eine Zwischenbilanz nach 20 Jahren , 2004 .

[16]  D. Treiman,et al.  Three Internationally Standardised Measures for Comparative Research on Occupational Status , 2003 .

[17]  R. Knaus,et al.  Methods and Problems in Coding Natural language Survey Data , 2002 .

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  Howard Speizer,et al.  Automated Coding of Survey Data , 1999 .

[21]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[22]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[23]  Peter Elias,et al.  Occupational Classification (ISCO-88): Concepts, Methods, Reliability, Validity and Cross-National Comparability , 1997 .

[24]  Damir Kalpić,et al.  Automated Coding of Census Data , 1994 .

[25]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .