Causal Inference Struggles with Agency on Online Platforms

Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate “Catch-22”s that suggest that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.

[1]  D. Byar Why data bases should not replace randomized clinical trials. , 1980, Biometrics.

[2]  J. Concato,et al.  Randomized, controlled trials, observational studies, and the hierarchy of research designs. , 2000, The New England journal of medicine.

[3]  Rebecca A. Maynard,et al.  The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs , 1987 .

[4]  Vivian C. Wong,et al.  What Can Be Learned From Empirical Evaluations of Nonexperimental Methods? , 2018, Evaluation review.

[5]  Colin Combe,et al.  Privacy, Big Data, and the Public Good: Frameworks for Engagement , 2015 .

[6]  S. Sundar,et al.  Personalization versus Customization: The Importance of Agency, Privacy, and Power Usage , 2010 .

[7]  Kyra Yee,et al.  Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency , 2021, Proc. ACM Hum. Comput. Interact..

[8]  K. Crawford,et al.  Where are human subjects in Big Data research? The emerging ethics divide , 2016, Big Data Soc..

[9]  A. Hartz,et al.  A comparison of observational studies and randomized, controlled trials. , 2000, The New England journal of medicine.

[10]  E. Stuart,et al.  Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies , 2015, Statistics in medicine.

[11]  Steven Glazerman,et al.  Nonexperimental Versus Experimental Estimates of Earnings Impacts , 2003 .

[12]  Wonsun Shin,et al.  Do Smartphone Power Users Protect Mobile Privacy Better than Nonpower Users? Exploring Power Usage as a Factor in Mobile Privacy Protection and Disclosure , 2016, Cyberpsychology Behav. Soc. Netw..

[13]  S J Pocock,et al.  Randomized trials or observational tribulations? , 2000, The New England journal of medicine.

[14]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[15]  Sean A. Munson,et al.  How the Design of YouTube Influences User Sense of Agency , 2021, CHI.

[16]  Jeffrey A. Smith,et al.  Does Matching Overcome Lalonde's Critique of Nonexperimental Estimators? , 2000 .

[17]  Matthew Gentzkow,et al.  The Welfare Effects of Social Media , 2019, American Economic Review.

[18]  Ro’ee Levy Social Media, News Consumption, and Polarization: Evidence from a Field Experiment , 2020, American Economic Review.

[19]  J. Angrist,et al.  Identification and Estimation of Local Average Treatment Effects , 1995 .

[20]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[21]  J. Ware,et al.  Randomized clinical trials. Perspectives on some recent ideas. , 1976, The New England journal of medicine.

[22]  David Hunter,et al.  Facebook emotional contagion experiment controversy , 2016 .

[23]  Brett R. Gordon,et al.  A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook , 2018, Mark. Sci..

[24]  Wendy E. Mackay,et al.  Triggers and barriers to customizing software , 1991, CHI.

[25]  Jeffrey T. Hancock,et al.  Experimental evidence of massive-scale emotional contagion through social networks , 2014, Proceedings of the National Academy of Sciences.

[26]  Mykola Makhortykh,et al.  Designing for the better by taking users into account: a qualitative evaluation of user control mechanisms in (news) recommender systems , 2019, RecSys.

[27]  Bu Zhong,et al.  From smartphones to iPad: Power users' disposition toward mobile media devices , 2013, Comput. Hum. Behav..

[28]  Vivian C. Wong,et al.  Designs of Empirical Evaluations of Nonexperimental Methods in Field Settings , 2018, Evaluation review.

[29]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[30]  James Grimmelmann,et al.  The Law and Ethics of Experiments on Social Media Users , 2015 .

[31]  Using Nonexperimental Methods to Address Noncompliance , 2020 .

[32]  Udi Manber,et al.  Experience with personalization of Yahoo! , 2000, CACM.

[33]  J. Lunceford,et al.  Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study , 2004, Statistics in medicine.

[34]  Benjamin Recht,et al.  Recommendations and user agency: the reachability of collaboratively-filtered information , 2020, FAT*.

[35]  Meg Leta Jones,et al.  The right to a human in the loop: Political constructions of computer automation and personhood , 2017, Social studies of science.

[36]  Raquel Benbunan-Fich,et al.  The ethics of online research with unsuspecting users: From A/B testing to C/D experimentation , 2017 .

[37]  Helen Nissenbaum,et al.  Big Data’s End Run around Anonymity and Consent , 2014, Book of Anonymity.

[38]  R. Lalonde Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[39]  Margot E. Kaminski,et al.  Binary Governance: Lessons from the GDPR's Approach to Algorithmic Accountability , 2019, SSRN Electronic Journal.

[40]  Brian K. Lee,et al.  Weight Trimming and Propensity Score Weighting , 2011, PloS one.