论文信息 - Online randomized controlled experiments at scale: lessons and extensions to medicine

Online randomized controlled experiments at scale: lessons and extensions to medicine

Background Many technology companies, including Airbnb, Amazon, Booking.com , eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originally derived from the same statistical roots, randomized controlled trials (RCTs) in medicine are now criticized for being expensive and difficult, while in technology, the marginal cost of such experiments is approaching zero and the value for data-driven decision-making is broadly recognized. Methods and results This is an overview of key scaling lessons learned in the technology field. They include (1) a focus on metrics, an overall evaluation criterion and thousands of metrics for insights and debugging, automatically computed for every experiment; (2) quick release cycles with automated ramp-up and shut-down that afford agile and safe experimentation, leading to consistent incremental progress over time; and (3) a culture of ‘test everything’ because most ideas fail and tiny changes sometimes show surprising outcomes worth millions of dollars annually. Technological advances, online interactions, and the availability of large-scale data allowed technology companies to take the science of RCTs and use them as online randomized controlled experiments at large scale with hundreds of such concurrent experiments running on any given day on a wide range of software products, be they web sites, mobile applications, or desktop applications. Rather than hindering innovation, these experiments enabled accelerated innovation with clear improvements to key metrics, including user experience and revenue. As healthcare increases interactions with patients utilizing these modern channels of web sites and digital health applications, many of the lessons apply. The most innovative technological field has recognized that systematic series of randomized trials with numerous failures of the most promising ideas leads to sustainable improvement. Conclusion While there are many differences between technology and medicine, it is worth considering whether and how similar designs can be applied via simple RCTs that focus on healthcare decision-making or service delivery. Changes – small and large – should undergo continuous and repeated evaluations in randomized trials and learning from their results will enable accelerated healthcare improvements.

Ron Kohavi | J. Ioannidis | Diane Tang | Ya Xu | L. Hemkens

[1] R Peto,et al. Why do we need some large, simple randomized trials? , 1984, Statistics in medicine.

[2] G. Belle. Statistical rules of thumb , 2002 .

[3] Hilde van der Togt,et al. Publisher's Note , 2003, J. Netw. Comput. Appl..

[4] J. Ioannidis,et al. Nested Randomized Trials in Large Cohorts and Biobanks: Studying the Health Effects of Lifestyle Factors , 2008, Epidemiology.

[5] Ashish Agarwal,et al. Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[6] Eric Ries. The lean startup : how today's entrepreneurs use continuous innovation to create radically successful businesses , 2011 .

[7] T. Peters,et al. Reporting of factorial trials of complex interventions in community settings: a systematic review , 2011, Trials.

[8] J. Ioannidis,et al. Risk factors and interventions with statistically significant tiny effects. , 2011, International journal of epidemiology.

[9] Ron Kohavi,et al. Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[10] J. Ioannidis,et al. Concordance of effects of medical interventions on hospital admission and readmission rates with effects on mortality , 2013, Canadian Medical Association Journal.

[11] Ron Kohavi,et al. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[12] Ron Kohavi,et al. Online controlled experiments at large scale , 2013, KDD.

[13] Michael Hay,et al. Clinical development success rates for investigational drugs , 2014, Nature Biotechnology.

[14] Ron Kohavi,et al. Seven rules of thumb for web site experimenters , 2014, KDD.

[15] J. Carlin,et al. Beyond Power Calculations , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[16] Michael S. Bernstein,et al. Designing and deploying online field experiments , 2014, WWW.

[17] Diane Tang,et al. Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[18] Tze Leung Lai,et al. Innovative designs of point-of-care comparative effectiveness trials. , 2015, Contemporary clinical trials.

[19] Diane Tang,et al. Focus on the Long-Term: It's better for Users and Business , 2015 .

[20] Sarah M. Greene,et al. Oversight on the borderline: Quality improvement and pragmatic research , 2015, Clinical trials.

[21] Anmol Bhasin,et al. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[22] Trudie Lang,et al. Making randomised trials more efficient: report of the first meeting to discuss the Trial Forge platform , 2015, Trials.

[23] Gareth Ambler,et al. Are multiple primary outcomes analysed appropriately in randomised controlled trials? A review. , 2015, Contemporary clinical trials.

[24] Huizhi Xie,et al. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix , 2016, KDD.

[25] Jason P. Fine,et al. Statistical Primer for Cardiovascular Research Introduction to the Analysis of Survival Data in the Presence of Competing Risks , 2022 .

[26] Ya Xu,et al. Evaluating Mobile Apps with A/B and Quasi A/B Tests , 2016, KDD.

[27] D. Messner,et al. Framing the conversation: use of PRECIS-2 ratings to advance understanding of pragmatic trial design domains , 2017, Trials.