Extracting information from big data: Issues of measurement, inference and linkage

Introduction Big data pose several interesting and new challenges to statisticians and others who want to extract information from data. As Groves pointedly commented, the era is “appropriately called Big Data as opposed to Big Information,” because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in the society.” The analytic challenges most often discussed are those related to three of the Vs that are used to characterize big data. The volume of truly massive data requires expansion of processing techniques that match modern hardware infrastructure, cloud computing with appropriate optimization mechanisms, and re-engineering of storage systems. The velocity of the data calls for algorithms that allow learning and updating on a continuous basis, and of course the computing infrastructure to do so. Finally, the variety of the data structures requires statistical methods that more easily allow for the combination of different data types collected at different levels, sometimes with a temporal and geographic structure. However, when it comes to privacy and confidentiality , the challenges of extracting (meaningful) information from big data are in our view similar to those associated with data of much smaller size, surveys being one example. For any statistician or quantitative working (social) scientist there are two main concerns when extracting information from data, which we summarize here as concerns about measurement and concerns about inference. Both of these aspects can be implicated by privacy and confidentiality concerns.

[1]  William Milberg,et al.  Is the Sky Falling?: Questioning the Conventional Wisdom on the U.S. Trade and Budget Deficits , 2007 .

[2]  D. Rubin,et al.  Principal Stratification in Causal Inference , 2002, Biometrics.

[3]  Frauke Kreuter,et al.  Placement, Wording, and Interviewers: Identifying Correlates of Consent to Link Survey and Administrative Data , 2013 .

[4]  Martin R. Frankel,et al.  Total Survey Error. , 1980 .

[5]  David Card,et al.  Workplace Heterogeneity and the Rise of West German Wage Inequality , 2012, SSRN Electronic Journal.

[6]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[7]  R. Tourangeau,et al.  The nonresponse challenge to surveys and statistics , 2013 .

[8]  Frauke Kreuter,et al.  Assessing the Magnitude of Non-Consent Biases in Linked Survey and Administrative Data , 2012 .

[9]  Catherine P. Bradshaw,et al.  The use of propensity scores to assess the generalizability of results from randomized trials , 2011, Journal of the Royal Statistical Society. Series A,.

[10]  James M. Dahlhamer,et al.  Privacy concerns, too busy, or just not interested: using doorstep concerns to predict survey nonresponse , 2008 .

[11]  Matthias Schonlau,et al.  Selection Bias in Web Surveys and the Use of Propensity Scores , 2006 .

[12]  E. D. Vaughn,et al.  The Writing on the (Facebook) Wall: The Use of Social Networking Sites in Hiring Decisions , 2011 .

[13]  Katherine A. Karl,et al.  Who's Posting Facebook Faux Pas? A Cross-Cultural Examination of Personality Differences , 2010 .

[14]  Paul A. Zandbergen,et al.  Accuracy of iPhone Locations: A Comparison of Assisted GPS, WiFi and Cellular Positioning , 2009 .

[15]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[16]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[17]  Eleanor Singer Confidentiality, Risk Perception, and Survey Participation , 2004 .

[18]  R. Tourangeau,et al.  Sensitive questions in surveys. , 2007, Psychological bulletin.

[19]  Mick P Couper,et al.  Experimental Studies of Disclosure Risk, Disclosure Harm, Topic Sensitivity, and Survey Participation. , 2010, Journal of official statistics.

[20]  Matthias Schonlau,et al.  Noncoverage and nonresponse in an Internet survey , 2007 .

[21]  Robert M. Groves,et al.  The Impact of Nonresponse Rates on Nonresponse Bias A Meta-Analysis , 2008 .

[22]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[23]  Hans-Jürgen Hippler,et al.  CONFIDENTIALITY ASSURANCES IN SURVEYS: REASSURANCE OR THREAT? , 1992 .

[24]  Kenneth Prewitt,et al.  The 2012 Morris Hansen Lecture: Thank You Morris, et al., For Westat, et al. , 2013 .

[25]  M. Couper,et al.  Sample Composition Discrepancies in Different Stages of a Probability-based Online Panel , 2013 .

[26]  E. Singer Toward a benefit-cost theory or survey participation: Evidence, further tests, and implications , 2011 .

[27]  Johannes F. Schmieder,et al.  The Effects of Extended Unemployment Insurance Over the Business Cycle: Evidence from Regression Discontinuity Estimates Over Twenty Years , 2012 .

[28]  F. S. P. Szuster,et al.  Nonsampling Error in Surveys , 1994 .

[29]  Robert M. Groves,et al.  Total Survey Error: Past, Present, and Future , 2010 .

[30]  Stephanie Eckman,et al.  Creating Housing Unit Frames from Address Databases , 2012 .

[31]  Rachel Schutt,et al.  Doing Data Science , 2013 .

[32]  Richard Valliant,et al.  Internet Surveys: Can Statistical Adjustments Eliminate Coverage Bias? , 2008 .

[33]  Richard Valliant,et al.  Estimating Propensity Adjustments for Volunteer Web Surveys , 2011 .

[34]  F. Kreuter,et al.  Social Desirability Bias in CATI, IVR, and Web Surveys The Effects of Mode and Question Sensitivity , 2008 .

[35]  Tom W. Smith The Report of the International Workshop on Using Multi-level Data from Sample Frames, Auxiliary Databases, Paradata and Related Sources to Detect and Adjust for Nonresponse Bias in Surveys , 2011 .

[36]  Ting Yan,et al.  Analyzing Paradata to Investigate Measurement Error , 2013 .

[37]  R. Groves Three Eras of Survey Research , 2011 .

[38]  S. Cole,et al.  Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. , 2010, American journal of epidemiology.

[39]  Mick P. Couper,et al.  THE IMPACT OF PRIVACY AND CONFIDENTIALITY CONCERNS ON SURVEY PARTICIPATION THE CASE OF THE 1990 U.S. CENSUS , 1993 .

[40]  M. Couper Is the sky falling? new technology, changing media, and the future of surveys , 2013 .