BARISTA: Efficient and Scalable Serverless Serving System for Deep Learning Prediction Services

Pre-trained deep learning models are increasingly being used to offer a variety of compute-intensive predictive analytics services such as fitness tracking, speech, and image recognition. The stateless and highly parallelizable nature of deep learning models makes them well-suited for serverless computing paradigm. However, making effective resource management decisions for these services is a hard problem due to the dynamic workloads and diverse set of available resource configurations that have different deployment and management costs. To address these challenges, we present a distributed and scalable deep-learning prediction serving system called Barista and make the following contributions. First, we present a fast and effective methodology for forecasting workloads by identifying various trends. Second, we formulate an optimization problem to minimize the total cost incurred while ensuring bounded prediction latency with reasonable accuracy. Third, we propose an efficient heuristic to identify suitable compute resource configurations. Fourth, we propose an intelligent agent to allocate and manage the compute resources by horizontal and vertical scaling to maintain the required prediction latency. Finally, using representative real-world workloads for an urban transportation service, we demonstrate and validate the capabilities of Barista.

[1]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[2]  Ricardo Bianchini,et al.  DejaVu: accelerating resource allocation in virtualized environments , 2012, ASPLOS XVII.

[3]  Erik Elmroth,et al.  Towards Faster Response Time Models for Vertical Elasticity , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[4]  Daniel A. Menascé,et al.  Near-Optimal Allocation of VMs from IaaS Providers by SaaS Providers , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[5]  Steven Hand,et al.  Adaptive Resource Provisioning for Virtualized Servers Using Kalman Filters , 2014, TAAS.

[6]  Xiangyu Li,et al.  Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[7]  Samuel Kounev,et al.  Self-adaptive workload classification and forecasting for proactive resource provisioning , 2013, ICPE '13.

[8]  Ludovico Iovino,et al.  Towards Recovering the Software Architecture of Microservice-Based Systems , 2017, 2017 IEEE International Conference on Software Architecture Workshops (ICSAW).

[9]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Beveridge-Nelson-Stock-Watson Decomposition,et al.  Structural Time Series Models , 2004 .

[11]  Aniruddha S. Gokhale,et al.  (WIP) CloudCAMP: Automating the Deployment and Management of Cloud Services , 2018, 2018 IEEE International Conference on Services Computing (SCC).

[12]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[13]  Aniruddha S. Gokhale,et al.  UPSARA: A Model-Driven Approach for Performance Analysis of Cloud-Hosted Applications , 2018, 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC).

[14]  Erik Blasch,et al.  Handbook of Dynamic Data Driven Applications Systems , 2018, Springer International Publishing.

[15]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[16]  Benjamin Letham,et al.  Forecasting at Scale , 2018, PeerJ Prepr..

[17]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[18]  Philippe Merle,et al.  Autonomic Vertical Elasticity of Docker Containers with ELASTICDOCKER , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[19]  Kevin Lee,et al.  Empirical prediction models for adaptive resource provisioning in the cloud , 2012, Future Gener. Comput. Syst..

[20]  Vatche Ishakian,et al.  Serving Deep Learning Models in a Serverless Platform , 2017, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[21]  Pascal Poupart,et al.  A bayesian approach to online performance modeling for database appliances using gaussian models , 2011, ICAC '11.

[22]  Aniruddha S. Gokhale,et al.  INDICES: Exploiting Edge Resources for Performance-Aware Cloud-Hosted Services , 2017, 2017 IEEE 1st International Conference on Fog and Edge Computing (ICFEC).

[23]  R. Tibshirani,et al.  Generalized Additive Models: Some Applications , 1987 .

[24]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[25]  Roberto Di Cosmo,et al.  Automatic Deployment of Services in the Cloud with Aeolus Blender , 2015, ICSOC.

[26]  Raul H. C. Lopes,et al.  Pengaruh Latihan Small Sided Games 4 Lawan 4 Dengan Maksimal Tiga Sentuhan Terhadap Peningkatan VO2MAX Pada Siswa SSB Tunas Muda Bragang Klampis U-15 , 2022, Jurnal Ilmiah Mandala Education.

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  Shunxing Bao,et al.  Stratum: A Serverless Framework for Lifecycle Management of Machine Learning based Data Analytics Tasks , 2019, OpML.

[29]  Oliver Kopp,et al.  OpenTOSCA - A Runtime for TOSCA-Based Cloud Applications , 2013, ICSOC.

[30]  Rajkumar Buyya,et al.  Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS , 2015, IEEE Transactions on Cloud Computing.

[31]  Sameh Elnikety,et al.  Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency , 2017, Middleware.

[32]  Antonio Brogi,et al.  QoS-Aware Deployment of IoT Applications Through the Fog , 2017, IEEE Internet of Things Journal.

[33]  H. Tucker A Generalization of the Glivenko-Cantelli Theorem , 1959 .

[34]  Aniruddha S. Gokhale,et al.  Performance Interference-Aware Vertical Elasticity for Cloud-Hosted Latency-Sensitive Applications , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[35]  Prashant J. Shenoy,et al.  Empirical evaluation of latency-sensitive application performance in the cloud , 2010, MMSys '10.

[36]  Aniruddha S. Gokhale,et al.  A Model-Driven Approach to Automate the Deployment and Management of Cloud Services , 2018, 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion).

[37]  Shrideep Pallickara,et al.  Serverless Computing: An Investigation of Factors Influencing Microservice Performance , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[38]  Chung-Horng Lung,et al.  An autonomic prediction suite for cloud resource provisioning , 2017, Journal of Cloud Computing.

[39]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[40]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[41]  Perry Cheng,et al.  Serverless Computing: Current Trends and Open Problems , 2017, Research Advances in Cloud Computing.