Using behavioral data to identify interviewer fabrication in surveys

Surveys conducted by human interviewers are one of the principal means of gathering data from all over the world, but the quality of this data can be threatened by interviewer fabrication. In this paper, we investigate a new approach to detecting interviewer fabrication automatically. We instrument electronic data collection software to record logs of low-level behavioral data and show that supervised classification, when applied to features extracted from these logs, can identify interviewer fabrication with an accuracy of up to 96%. We show that even when interviewers know that our approach is being used, have some knowledge of how it works, and are incentivized to avoid detection, it can still achieve an accuracy of 86%. We also demonstrate the robustness of our approach to a moderate amount of label noise and provide practical recommendations, based on empirical evidence, on how much data is needed for our approach to be effective.

[1]  U. S. Census,et al.  Evaluation of the Quality Assurance Falsification Interview used in the Census 2000 Dress Rehearsal , 2002 .

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  U. S. Census,et al.  INTERVIEWER FALSIFICATION IN CENSUS BUREAU SURVEYS , 2002 .

[4]  J A Inciardi Fictitious data in drug abuse research. , 1981, The International journal of the addictions.

[5]  Mick P. Couper,et al.  Usability Evaluation of Computer-Assisted Survey Instruments , 2000 .

[6]  Benjamin E. Birnbaum,et al.  Automated quality control for mobile data collection , 2012, ACM DEV '12.

[7]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[8]  L. Crespi,et al.  THE CHEATER PROBLEM IN POLLING , 1945 .

[9]  J. Nelson,et al.  Do Interviewers Follow Telephone Survey Instructions? , 1996 .

[10]  Dorian G. W. Smith,et al.  Palm computer demonstrates a fast and accurate means of burn data collection. , 2000, The Journal of burn care & rehabilitation.

[11]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[12]  J. Michael Brick,et al.  Using statistical models for sample design of a reinterview program , 2011 .

[13]  David F. Redmiles,et al.  Extracting usability information from user interface events , 2000, CSUR.

[14]  Arin Ghazarian,et al.  Automatic detection of users’ skill levels using high-frequency user interface events , 2010, User Modeling and User-Adapted Interaction.

[15]  Frauke Kreuter,et al.  Using paradata to explore item level response times in surveys , 2013 .

[16]  J. Blaya,et al.  E-health technologies show promise in developing countries. , 2010, Health affairs.

[17]  Gaetano Borriello,et al.  Open data kit: tools to build information services for developing regions , 2010, ICTD.

[18]  Tapan S. Parikh,et al.  Mobile phone tools for field-based health care workers in low-income countries. , 2011, The Mount Sinai journal of medicine, New York.

[19]  Kentaro Toyama,et al.  Mobile phones and paper documents: evaluating a new approach for capturing microfinance data in rural India , 2006, CHI.

[20]  Kentaro Toyama,et al.  Managing microfinance with paper, pen and digital slate , 2010, ICTD 2010.

[21]  Stefan Stieger,et al.  What are participants doing while filling in an online questionnaire: A paradata collection tool and an empirical study , 2010, Comput. Hum. Behav..

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Keith A. Albright,et al.  USING DATE AND TIME STAMPS TO DETECT INTERVIEWER FALSIFICATION , 2002 .

[24]  Joseph M. Hellerstein,et al.  Improving data quality with dynamic forms , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[25]  Peter Winker,et al.  A statistical approach to detect cheating interviewers , 2008 .

[26]  Christin Schäfer,et al.  Automatic Identification of Faked and Fraudulent Interviews in Surveys by Two Different Methods , 2004 .

[27]  Hwa Sun Kim,et al.  Adoption of a PDA-Based Home Hospice Care System for Cancer Patients , 2009, Computers, informatics, nursing : CIN.

[28]  John L. Eltinge,et al.  Inferential Methods to Identify Possible Interviewer Fraud Using Leading Digit Preference Patterns and Design Effect Matrices , 2004 .

[29]  K. Johnson An Update. , 1984, Journal of food protection.

[30]  Scott E. Hudson,et al.  Automatically detecting pointing performance , 2008, IUI '08.

[31]  J. Bushery,et al.  GETTING MORE BANG FROM THE REINTERVIEW BUCK: IDENTIFYING "AT RISK" INTERVIEWERS , 2002 .

[32]  Reginald P. Baker,et al.  New Technology in Survey Research: Computer-Assisted Personal Interviewing (CAPI) , 1992 .

[33]  Anna R. Karlin,et al.  Algorithmic Approaches to Detecting Interviewer Fabrication in Surveys , 2012 .

[34]  F. Evans,et al.  LIVING RESEARCH ON INTERVIEWER CHEATING , 1961 .

[35]  Archibald S. Bennett Toward a Solution of the “Cheater Problem” among Part-time Research Investigators , 1948 .

[36]  Joseph M. Hellerstein,et al.  USHER: Improving data quality with dynamic forms , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[37]  George G. Judge,et al.  Detecting Problems in Survey Data Using Benford’s Law , 2007, The Journal of Human Resources.