Sayer: Using Implicit Feedback to Optimize System Policies

We observe that many system policies that make threshold decisions involving a resource (e.g., time, memory, cores) naturally reveal additional, or implicit feedback. For example, if a system waits X min for an event to occur, then it automatically learns what would have happened if it waited < X min, because time has a cumulative property. This feedback tells us about alternative decisions, and can be used to improve the system policy. However, leveraging implicit feedback is difficult because it tends to be one-sided or incomplete, and may depend on the outcome of the event. As a result, existing practices for using feedback, such as simply incorporating it into a data-driven model, suffer from bias. We develop a methodology, called Sayer, that leverages implicit feedback to evaluate and train new system policies. Sayer builds on two ideas from reinforcement learning---randomized exploration and unbiased counterfactual estimators---to leverage data collected by an existing policy to estimate the performance of new candidate policies, without actually deploying those policies. Sayer uses implicit exploration and implicit data augmentation to generate implicit feedback in an unbiased form, which is then used by an implicit counterfactual estimator to evaluate and train new policies. The key idea underlying these techniques is to assign implicit probabilities to decisions that are not actually taken but whose feedback can be inferred; these probabilities are carefully calculated to ensure statistical unbiasedness. We apply Sayer to two production scenarios in Azure, and show that it can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.

[1]  Gerald Tesauro,et al.  Reinforcement Learning in Autonomic Computing: A Manifesto and Case Studies , 2007, IEEE Internet Computing.

[2]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[3]  Bruno Ribeiro,et al.  Pitfalls of data-driven networking: A case study of latent causal confounders in video streaming , 2020, NetAI@SIGCOMM.

[4]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[5]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[6]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[7]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[8]  Gautam Kumar,et al.  Hold 'em or fold 'em?: aggregation queries under performance variations , 2016, EuroSys.

[9]  Aleksandrs Slivkins,et al.  Harvesting Randomness to Optimize Distributed Systems , 2017, HotNets.

[10]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .

[11]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[12]  Ming Zhang,et al.  Efficiently Delivering Online Services over Integrated Infrastructure , 2016, NSDI.

[13]  Guoliang Li,et al.  QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning , 2019, Proc. VLDB Endow..

[14]  AmmarMostafa,et al.  Answering what-if deployment and configuration questions with wise , 2008 .

[15]  Vyas Sekar,et al.  CFA: A Practical Prediction System for Video QoE Optimization , 2016, NSDI.

[16]  Ramesh K. Sitaraman,et al.  Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs , 2012, IEEE/ACM Transactions on Networking.

[17]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[18]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[19]  Bruno Sinopoli,et al.  Biases in Data-Driven Networking, and What to Do About Them , 2017, HotNets.

[20]  Sebastian Burckhardt,et al.  Effective Data-Race Detection for the Kernel , 2010, OSDI.

[21]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[22]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[23]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[24]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[25]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[26]  Mo Dong,et al.  PCC: Re-architecting Congestion Control for Consistent High Performance , 2014, NSDI.

[27]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[28]  Christoforos E. Kozyrakis,et al.  Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics , 2018, USENIX Annual Technical Conference.

[29]  Hongzi Mao,et al.  Neural Adaptive Video Streaming with Pensieve , 2017, SIGCOMM.

[30]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[31]  Bruno Sinopoli,et al.  A Control-Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP , 2015, Comput. Commun. Rev..

[32]  Sally Floyd,et al.  Difficulties in simulating the internet , 2001, TNET.

[33]  Suman Nath,et al.  WebPerf: Evaluating What-If Scenarios for Cloud-hosted Web Applications , 2016, SIGCOMM.

[34]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[35]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[36]  Vyas Sekar,et al.  Via: Improving Internet Telephony Call Quality Using Predictive Relay Selection , 2016, SIGCOMM.

[37]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[38]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[39]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[40]  J. Robins,et al.  Semiparametric regression estimation in the presence of dependent censoring , 1995 .

[41]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[42]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[43]  Ron Kohavi,et al.  Online Controlled Experiments and A / B Tests , 2015 .

[44]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[45]  Sylvia Ratnasamy,et al.  On the Use of ML for Blackbox System Performance Prediction , 2021, NSDI.