INFaaS: Automated Model-less Inference Serving

Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain challenges at large scales. Developers must manually search through thousands of model-variants – versions of already-trained models that differ in hardware, resource footprints, latencies, costs, and accuracies – to meet the diverse application requirements. Since requirements, query load, and applications themselves evolve over time, these decisions need to be made dynamically for each inference query to avoid excessive costs through naive autoscaling. To avoid navigating through the large and complex trade-off space of model-variants, developers often fix a variant across queries, and replicate it when load increases. However, given the diversity across variants and hardware platforms in the cloud, a lack of understanding of the trade-off space can incur significant costs to developers. This paper introduces INFaaS, an automated model-less system for distributed inference serving, where developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query. INFaaS generates model-variants from already trained models, and efficiently navigates the large trade-off space of model-variants on behalf of developers to meet application-specific objectives: (a) for each query, it selects a model, hardware architecture, and model optimizations, (b) it combines VM-level horizontal autoscaling with model-level autoscaling, where multiple, different modelvariants are used to serve queries within each machine. By leveraging diverse variants and sharing hardware resources across models, INFaaS achieves 1.3× higher throughput, violates latency objectives 1.6× less often, and saves up to 21.6× in cost (8.5× on average) compared to state-of-the-art inference serving systems on AWS EC2.

[1]  Amar Phanishayee,et al.  Themis: Fair and Efficient GPU Cluster Scheduling , 2020, NSDI.

[2]  Paramvir Bahl,et al.  Live Video Analytics at Scale with Approximation and Delay-Tolerance , 2017, NSDI.

[3]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[4]  Christoforos E. Kozyrakis,et al.  Pocket: Elastic Ephemeral Storage for Serverless Analytics , 2018, OSDI.

[5]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[6]  Ricardo Bianchini,et al.  Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider , 2020, USENIX Annual Technical Conference.

[7]  Sameh Elnikety,et al.  Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems , 2020, HotCloud.

[8]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[9]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[12]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[13]  Paramvir Bahl,et al.  Focus: Querying Large Video Datasets with Low Latency and Low Cost , 2018, OSDI.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  J. Gathen,et al.  A bound on solutions of linear integer equalities and inequalities , 1978 .

[16]  Michael Riley,et al.  Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant , 2018, INTERSPEECH.

[17]  Ymir Vigfusson,et al.  Serving DNNs like Clockwork: Performance Predictability from the Bottom Up , 2020, OSDI.

[18]  Byung-Gon Chun,et al.  PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems , 2018, OSDI.

[19]  Ion Stoica,et al.  Occupy the cloud: distributed computing for the 99% , 2017, SoCC.

[20]  Behzad Boroujerdian,et al.  One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-Off in Machine Learning Cloud Service APIs via Tolerance Tiers , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  Jie Wu,et al.  Energy efficient virtual machine placement algorithm with balanced and improved resource utilization in a data center , 2013, Math. Comput. Model..

[22]  Mor Harchol-Balter,et al.  AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers , 2012, TOCS.

[23]  William A. Wulf,et al.  Policy/mechanism separation in Hydra , 1975, SOSP.

[24]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Hadi Esmaeilzadeh,et al.  Shredder: Learning Noise Distributions to Protect Inference Privacy , 2020, ASPLOS.

[27]  Mosharaf Chowdhury,et al.  Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications , 2019, MLSys.

[28]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  M. Tamer Özsu,et al.  ConfluxDB: Multi-Master Replication for Partitioned Snapshot Isolation Databases , 2014, Proc. VLDB Endow..

[30]  Quan Quan,et al.  A portable, automatic data qantizer for deep neural networks , 2018, PACT.

[31]  Joseph Gonzalez,et al.  InferLine: latency-aware provisioning and scaling for prediction serving pipelines , 2020, SoCC.

[32]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[33]  Laura Kallmeyer,et al.  Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks , 2018, LREC.

[34]  Pat Hanrahan,et al.  Scanner: Efficient Video Analysis at Scale , 2018, ACM Trans. Graph..

[35]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[36]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[37]  Wei Wang,et al.  MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving , 2019, USENIX Annual Technical Conference.

[38]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[39]  Matei Zaharia,et al.  NoScope: Optimizing Deep CNN-Based Queries over Video Streams at Scale , 2017, Proc. VLDB Endow..

[40]  Steven S. Seiden,et al.  On the online bin packing problem , 2001, JACM.

[41]  Ion Stoica,et al.  Chameleon: scalable adaptation of video analytics , 2018, SIGCOMM.

[42]  Haichen Shen,et al.  Nexus: a GPU cluster engine for accelerating DNN-based video analysis , 2019, SOSP.

[43]  Christina Delimitrou,et al.  Mage: online and interference-aware scheduling for multi-scale heterogeneous systems , 2018, PACT.

[44]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[45]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[46]  Jinjun Xiong,et al.  TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments , 2018, ArXiv.

[47]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[48]  Ajay Jain,et al.  Dynamic Space-Time Scheduling for GPU Inference , 2018, ArXiv.

[49]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[50]  Sameh Elnikety,et al.  Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency , 2017, Middleware.

[51]  Qian Li,et al.  A Case for Managed and Model-less Inference Serving , 2019, HotOS.

[52]  Paolo Napoletano,et al.  Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.

[53]  Krish Shankar,et al.  Azure Machine Learning , 2019 .