Online randomized controlled experiments at scale: lessons and extensions to medicine

Background Many technology companies, including Airbnb, Amazon, Booking.com , eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originally derived from the same statistical roots, randomized controlled trials (RCTs) in medicine are now criticized for being expensive and difficult, while in technology, the marginal cost of such experiments is approaching zero and the value for data-driven decision-making is broadly recognized. Methods and results This is an overview of key scaling lessons learned in the technology field. They include (1) a focus on metrics, an overall evaluation criterion and thousands of metrics for insights and debugging, automatically computed for every experiment; (2) quick release cycles with automated ramp-up and shut-down that afford agile and safe experimentation, leading to consistent incremental progress over time; and (3) a culture of ‘test everything’ because most ideas fail and tiny changes sometimes show surprising outcomes worth millions of dollars annually. Technological advances, online interactions, and the availability of large-scale data allowed technology companies to take the science of RCTs and use them as online randomized controlled experiments at large scale with hundreds of such concurrent experiments running on any given day on a wide range of software products, be they web sites, mobile applications, or desktop applications. Rather than hindering innovation, these experiments enabled accelerated innovation with clear improvements to key metrics, including user experience and revenue. As healthcare increases interactions with patients utilizing these modern channels of web sites and digital health applications, many of the lessons apply. The most innovative technological field has recognized that systematic series of randomized trials with numerous failures of the most promising ideas leads to sustainable improvement. Conclusion While there are many differences between technology and medicine, it is worth considering whether and how similar designs can be applied via simple RCTs that focus on healthcare decision-making or service delivery. Changes – small and large – should undergo continuous and repeated evaluations in randomized trials and learning from their results will enable accelerated healthcare improvements.

[1]  R Peto,et al.  Why do we need some large, simple randomized trials? , 1984, Statistics in medicine.

[2]  G. Belle Statistical rules of thumb , 2002 .

[3]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[4]  J. Ioannidis,et al.  Nested Randomized Trials in Large Cohorts and Biobanks: Studying the Health Effects of Lifestyle Factors , 2008, Epidemiology.

[5]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[6]  Eric Ries The lean startup : how today's entrepreneurs use continuous innovation to create radically successful businesses , 2011 .

[7]  T. Peters,et al.  Reporting of factorial trials of complex interventions in community settings: a systematic review , 2011, Trials.

[8]  J. Ioannidis,et al.  Risk factors and interventions with statistically significant tiny effects. , 2011, International journal of epidemiology.

[9]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[10]  J. Ioannidis,et al.  Concordance of effects of medical interventions on hospital admission and readmission rates with effects on mortality , 2013, Canadian Medical Association Journal.

[11]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[12]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[13]  Michael Hay,et al.  Clinical development success rates for investigational drugs , 2014, Nature Biotechnology.

[14]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[15]  J. Carlin,et al.  Beyond Power Calculations , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[16]  Michael S. Bernstein,et al.  Designing and deploying online field experiments , 2014, WWW.

[17]  Diane Tang,et al.  Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[18]  Tze Leung Lai,et al.  Innovative designs of point-of-care comparative effectiveness trials. , 2015, Contemporary clinical trials.

[19]  Diane Tang,et al.  Focus on the Long-Term: It's better for Users and Business , 2015 .

[20]  Sarah M. Greene,et al.  Oversight on the borderline: Quality improvement and pragmatic research , 2015, Clinical trials.

[21]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[22]  Trudie Lang,et al.  Making randomised trials more efficient: report of the first meeting to discuss the Trial Forge platform , 2015, Trials.

[23]  Gareth Ambler,et al.  Are multiple primary outcomes analysed appropriately in randomised controlled trials? A review. , 2015, Contemporary clinical trials.

[24]  Huizhi Xie,et al.  Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[25]  Jason P. Fine,et al.  Statistical Primer for Cardiovascular Research Introduction to the Analysis of Survival Data in the Presence of Competing Risks , 2022 .

[26]  Ya Xu,et al.  Evaluating Mobile Apps with A/B and Quasi A/B Tests , 2016, KDD.

[27]  D. Messner,et al.  Framing the conversation: use of PRECIS-2 ratings to advance understanding of pragmatic trial design domains , 2017, Trials.

[28]  Angus G K McNair,et al.  The COMET Handbook: version 1.0 , 2017, Trials.

[29]  H. Naci,et al.  Availability of evidence of benefits on overall survival and quality of life of cancer drugs approved by European Medicines Agency: retrospective cohort study of drug approvals 2009-13 , 2017, British Medical Journal.

[30]  Ron Kohavi,et al.  The Surprising Power of Online Experiments , 2017 .

[31]  Jan Bosch,et al.  The Evolution of Continuous Experimentation in Software Product Development: From Data to a Data-Driven Organization at Scale , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[32]  V. Prasad Do cancer drugs improve survival or quality of life? , 2017, British Medical Journal.

[33]  J. Ioannidis,et al.  Real-world evidence: How pragmatic are randomized controlled trials labeled as pragmatic? , 2018, BMC Medicine.

[34]  J. Ioannidis,et al.  Routinely collected data for randomized trials: promises, barriers, and implications , 2018, Trials.

[35]  Leora I. Horwitz,et al.  Creating a Learning Health System through Rapid-Cycle, Randomized Testing. , 2019, The New England journal of medicine.

[36]  Matthias Briel,et al.  Current use and costs of electronic health records for clinical trial research: a descriptive study. , 2019, CMAJ open.

[37]  Pavel Dmitriev,et al.  Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners , 2019, KDD.

[38]  R. Stephens,et al.  A limited number of medicines pragmatic trials had potential for waived informed consent following the 2016 CIOMS ethical guidelines. , 2019, Journal of clinical epidemiology.

[39]  Ron Kohavi,et al.  Trustworthy Online Controlled Experiments , 2020 .