Algorithmic Approaches to Detecting Interviewer Fabrication in Surveys

Surveys are one of the principal means of gathering critical data from low-income regions. Bad data, however, may be no better—or worse—than no data at all. Interviewer data fabrication, one cause of bad data, is an ongoing concern of survey organizations and a constant threat to data quality. In my dissertation work, I build software that automatically identifies interviewer fabrication so that supervisors can act to reduce it. To do so, I draw on two tool sets from computer science, one algorithmic and the other technological. On the algorithmic side, I use two sets of techniques from machine learning, supervised classification and anomaly detection, to automatically identify interviewer fabrication. On the technological side, I modify data collection software running on mobile electronic devices to record user traces that can help to identify fabrication. I show, based on the results of two empirical studies, that the combination of these approaches makes it possible to accurately and robustly identify interviewer fabrication, even when interviewers are aware that the algorithms are being used, have some knowledge of how they work, and are incentivized to avoid detection.

[1]  Kentaro Toyama,et al.  Mobile phones and paper documents: evaluating a new approach for capturing microfinance data in rural India , 2006, CHI.

[2]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[3]  Kentaro Toyama,et al.  Managing microfinance with paper, pen and digital slate , 2010, ICTD 2010.

[4]  George Laura Judge,et al.  Detecting Problems in Survey Data Using Benford's Law , 2012 .

[5]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[6]  William M. Tierney,et al.  A computer-based medical record system and personal digital assistants to assess and follow patients with respiratory tract infections visiting a rural Kenyan health centre , 2006, BMC Medical Informatics Decis. Mak..

[7]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[8]  Rajendra Prasad,et al.  Population Health Metrics Research Consortium gold standard verbal autopsy validation study: design, implementation, and development of analysis datasets , 2011, Population health metrics.

[9]  Archibald S. Bennett Toward a Solution of the “Cheater Problem” among Part-time Research Investigators , 1948 .

[10]  Paul P. Biemer,et al.  Introduction to Survey Quality , 2003 .

[11]  Roger Eeckels,et al.  Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities , 2005, PLoS medicine.

[12]  J. Nelson,et al.  Do Interviewers Follow Telephone Survey Instructions? , 1996 .

[13]  Dorian G. W. Smith,et al.  Palm computer demonstrates a fast and accurate means of burn data collection. , 2000, The Journal of burn care & rehabilitation.

[14]  Joseph M. Hellerstein,et al.  USHER: Improving data quality with dynamic forms , 2011, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[16]  Joseph M. Hellerstein,et al.  Data in the First Mile , 2011, CIDR.

[17]  Harper W. Boyd,et al.  Interviewers as a Source of Error in Surveys , 1955 .

[18]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[19]  Gaetano Borriello,et al.  Open Data Kit: Technologies for Mobile Data Collection and Deployment Experiences in Developing Regions , 2012 .

[20]  Hamish S. F. Fraser,et al.  Development, implementation and preliminary study of a PDA-based bacteriology collection system , 2006, AMIA.

[21]  Ping Yu,et al.  The development and evaluation of a PDA-based method for public health surveillance data collection in developing countries , 2009, Int. J. Medical Informatics.

[22]  Abraham D Flaxman,et al.  Performance of the Tariff Method: validation of a simple additive algorithm for analysis of verbal autopsies , 2011, Population health metrics.

[23]  U. S. Census,et al.  INTERVIEWER FALSIFICATION IN CENSUS BUREAU SURVEYS , 2002 .

[24]  Marcel Tanner,et al.  The use of personal digital assistants for data entry at the point of collection in a large household survey in southern Tanzania , 2007, Emerging themes in epidemiology.

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  L. Crespi,et al.  THE CHEATER PROBLEM IN POLLING , 1945 .

[27]  J. Bushery,et al.  GETTING MORE BANG FROM THE REINTERVIEW BUCK: IDENTIFYING "AT RISK" INTERVIEWERS , 2002 .

[28]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[29]  A. Raftery,et al.  Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes , 1998 .

[30]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[31]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[32]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[33]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[34]  U. S. Census,et al.  Evaluation of the Quality Assurance Falsification Interview used in the Census 2000 Dress Rehearsal , 2002 .

[35]  Ronald Rosenfeld,et al.  Speech vs. touch-tone: Telephony interfaces for information access by low literate users , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[36]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[37]  Daniel Chandramohan,et al.  Verbal autopsy: current practices and challenges. , 2006, Bulletin of the World Health Organization.

[38]  Gaetano Borriello,et al.  Improving community health worker performance through automated SMS , 2012, ICTD.

[39]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[40]  Neal Lesh,et al.  Using Mobile Applications for Community-based Social Support for Chronic Patients , 2009 .

[41]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[42]  R. Groves,et al.  Survey Errors and Survey Costs. , 1991 .

[43]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  J. Blaya,et al.  E-health technologies show promise in developing countries. , 2010, Health affairs.

[46]  F. Evans,et al.  LIVING RESEARCH ON INTERVIEWER CHEATING , 1961 .

[47]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[48]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[49]  Apala Lahiri Chavan,et al.  Design studies for a financial management system for micro-credit groups in rural india , 2002 .

[50]  Keith A. Albright,et al.  USING DATE AND TIME STAMPS TO DETECT INTERVIEWER FALSIFICATION , 2002 .

[51]  Ken Banks,et al.  FrontlineSMS and Ushahidi - a demo , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[52]  Benjamin E. Birnbaum,et al.  Automated quality control for mobile data collection , 2012, ACM DEV '12.

[53]  R. Gilman,et al.  Can the power of mobile phones be used to improve tuberculosis diagnosis in developing countries? , 2009, Transactions of the Royal Society of Tropical Medicine and Hygiene.

[54]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[55]  Jingke Xi,et al.  Outlier Detection Algorithms in Data Mining , 2008, 2008 Second International Symposium on Intelligent Information Technology Application.

[56]  J A Inciardi Fictitious data in drug abuse research. , 1981, The International journal of the addictions.

[57]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[58]  Moon Jung Cho,et al.  Detecting Possibly Fraudulent or Error-Prone Survey Data Using Benford's Law , 2003 .

[59]  Gaetano Borriello,et al.  Technology for workforce performance improvement of community health programs , 2011 .

[60]  Eric Horvitz,et al.  People, Quakes, and Communications: Inferences from Call Dynamics about a Seismic Event and its Influences on a Population , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[61]  Shannon J. Lane,et al.  Bmc Medical Informatics and Decision Making a Review of Randomized Controlled Trials Comparing the Effectiveness of Hand Held Computers with Paper Methods for Data Collection , 2006 .

[62]  Gaetano Borriello,et al.  Open Source Data Collection in the Developing World , 2009, Computer.

[63]  Emma Brunskill,et al.  Evaluating the accuracy of data collection on mobile phones: A study of forms, SMS, and voice , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[64]  J. Michael Brick,et al.  Using statistical models for sample design of a reinterview program , 2011 .

[65]  M. Degroot,et al.  Probability and Statistics , 1977 .

[66]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[67]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[68]  John L. Eltinge,et al.  Inferential Methods to Identify Possible Interviewer Fraud Using Leading Digit Preference Patterns and Design Effect Matrices , 2004 .

[69]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[70]  Theodore P. Hill,et al.  The Difficulty of Faking Data , 1999 .

[71]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[72]  Gaetano Borriello,et al.  Open data kit: tools to build information services for developing regions , 2010, ICTD.

[73]  Tapan S. Parikh,et al.  Mobile phone tools for field-based health care workers in low-income countries. , 2011, The Mount Sinai journal of medicine, New York.

[74]  Christopher JL Murray,et al.  Health metrics and evaluation: strengthening the science , 2008, The Lancet.

[75]  Joseph M. Hellerstein,et al.  Designing adaptive feedback for improving data entry accuracy , 2010, UIST.

[76]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[77]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[78]  Peter Winker,et al.  A statistical approach to detect cheating interviewers , 2008 .

[79]  Christin Schäfer,et al.  Automatic Identification of Faked and Fraudulent Interviews in Surveys by Two Different Methods , 2004 .

[80]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[81]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[82]  Hwa Sun Kim,et al.  Adoption of a PDA-Based Home Hospice Care System for Cancer Patients , 2009, Computers, informatics, nursing : CIN.

[83]  Joseph M. Hellerstein,et al.  Improving data quality with dynamic forms , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[84]  Joseph M. Hellerstein,et al.  Shreddr: pipelined paper digitization for low-resource organizations , 2012, ACM DEV '12.

[85]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[86]  Gaetano Borriello,et al.  Digitizing paper forms with mobile imaging technologies , 2012, ACM DEV '12.

[87]  M. Bamberger,et al.  Monitoring and evaluation : some tools, methods, and approaches , 2004 .

[88]  Tapan S. Parikh Engineering rural development , 2009, CACM.

[89]  Gaetano Borriello,et al.  Improving Clinical Decision Support in Low-Income Regions , 2012 .

[90]  Alan D. Lopez,et al.  Counting the dead and what they died from: an assessment of the global status of cause of death data. , 2005, Bulletin of the World Health Organization.

[91]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[92]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.