One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-Off in Machine Learning Cloud Service APIs via Tolerance Tiers

Today's cloud service architectures follow a “one size fits all” deployment strategy where the same service version instantiation is provided to the end users. However, consumers are broad and different applications have different accuracy and responsiveness requirements, which as we demonstrate renders the “one size fits all” approach inefficient in practice. We use a production grade speech recognition engine, which serves several thousands of users, and an open source computer vision based system, to explain our point. To overcome the limitations of the “one size fits all” approach, we recommend Tolerance Tiers where each MLaaS tier exposes an accuracy/responsiveness characteristic, and consumers can programmatically select a tier. We evaluate our proposal on the CPU-based automatic speech recognition (ASR) engine and cutting-edge neural networks for image classification deployed on both CPUs and GPUs. The results show that our proposed approach provides a MLaaS cloud service architecture that can be tuned by the end API user or consumer to outperform the conventional “one size fits all” approach.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[4]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[5]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[7]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[8]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[11]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Scott A. Mahlke,et al.  Rumba: An online quality management system for approximate computing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15]  Benjamin C. Lee,et al.  Market mechanisms for managing datacenters with heterogeneous microarchitectures , 2014, TOCS.

[16]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[17]  Kaushik Roy,et al.  Conditional Deep Learning for energy-efficient and enhanced pattern recognition , 2015, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[19]  Jose-Maria Arnau,et al.  An ultra low-power hardware accelerator for automatic speech recognition , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[21]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[22]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alec Wolman,et al.  MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints , 2016, MobiSys.

[24]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[25]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[26]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[28]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[29]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Scott A. Mahlke,et al.  Input responsiveness: using canary inputs to dynamically steer approximation , 2016, PLDI.

[32]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[33]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[34]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[35]  Geoffrey Zweig,et al.  Parallelizing WFST speech decoders , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Sungroh Yoon,et al.  Big/little deep neural network for ultra low power inference , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[37]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Hadi Esmaeilzadeh,et al.  Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[40]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[41]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[42]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.