AutoSys: The Design and Operation of Learning-Augmented Systems

Although machine learning (ML) and deep learning (DL) provide new possibilities into optimizing system design and performance, taking advantage of this paradigm shift requires more than implementing existing ML/DL algorithms. This paper reports our years of experience in designing and operating several production learning-augmented systems at Microsoft. AutoSys is a framework that unifies the development process, and it addresses common design considerations including ad-hoc and nondeterministic jobs, learning-induced system failures, and programming extensibility. Furthermore, this paper demonstrates the benefits of adopting AutoSys with measurements from one production system, Web Search. Finally, we share long-term lessons stemmed from unforeseen implications that have surfaced over the years of operating learning-augmented systems.

[1]  Bhaskar Mitra,et al.  Optimizing Query Evaluations Using Reinforcement Learning for Web Search , 2018, SIGIR.

[2]  Michael D. Ernst,et al.  Which configuration option should I change? , 2014, ICSE.

[3]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[4]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[5]  Hongzi Mao,et al.  Neural Adaptive Video Streaming with Pensieve , 2017, SIGCOMM.

[6]  D. Sculley,et al.  Vizier : A Service for Black-Box Optimization , 2017 .

[7]  Aditya Akella,et al.  Demystifying configuration challenges and trade-offs in network-based ISP services , 2011, SIGCOMM.

[8]  Zhen Cao,et al.  Towards Better Understanding of Black-box Auto-Tuning: A Comparative Analysis for Storage Systems , 2018, USENIX Annual Technical Conference.

[9]  Gregory R. Ganger,et al.  Self-* Storage: Brick-based Storage with Automated Administration (CMU-CS-03-178) , 2003 .

[10]  Shipeng Li,et al.  Query-driven iterated neighborhood graph search for large scale indexing , 2012, ACM Multimedia.

[11]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[12]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[13]  Xu Zhang,et al.  Cross-dataset Time Series Anomaly Detection for Cloud Systems , 2019, USENIX Annual Technical Conference.

[14]  Ilya O. Ryzhov,et al.  On the Convergence Rates of Expected Improvement Methods , 2016, Oper. Res..

[15]  Enhong Chen,et al.  Systematically testing background services of mobile apps , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[17]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[18]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[19]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[20]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[21]  Carlos Urias Munoz,et al.  Automatic Generation of Random Self-Checking Test Cases , 1983, IBM Syst. J..

[22]  David A. Patterson,et al.  Technical perspective: the data center is the computer , 2008, CACM.

[23]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[24]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[25]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.

[26]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[27]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[28]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[29]  Jin Jiang,et al.  Metis: robustly optimizing tail latencies of cloud systems , 2018, USENIX ATC 2018.

[30]  Tudor Dumitras,et al.  Cloud software upgrades: Challenges and opportunities , 2011, 2011 International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems.

[31]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[32]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[33]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[34]  Ramakrishnan Rajamony,et al.  An updated performance comparison of virtual machines and Linux containers , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[36]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[37]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[38]  Randy H. Katz,et al.  Static extraction of program configuration options , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[39]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[40]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[41]  Ranveer Chandra,et al.  Caiipa: automated large-scale mobile app testing through contextual fuzzing , 2014, MobiCom.

[42]  Yuanyuan Zhou,et al.  Understanding Customer Problem Troubleshooting from Storage System Logs , 2009, FAST.

[43]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[44]  Saeed Amizadeh,et al.  Generic and Scalable Framework for Automated Time-series Anomaly Detection , 2015, KDD.

[45]  Xin Wang,et al.  Machine Learning for Networking: Workflow, Advances and Opportunities , 2017, IEEE Network.

[46]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[47]  Nikolaj Bjørner,et al.  Latent fault detection in large scale services , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[48]  Mao Yang,et al.  The Case for Learning-and-System Co-design , 2019, ACM SIGOPS Oper. Syst. Rev..