AI Tax: The Hidden Cost of AI Data Center Applications

Artificial intelligence and machine learning are experiencing widespread adoption in industry and academia. This has been driven by rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into specialized hardware AI accelerators. Given the rapid pace of advances, it is easy to forget that they are often developed and evaluated in a vacuum without considering the full application environment. This paper emphasizes the need for a holistic, end-to-end analysis of AI workloads and reveals the "AI tax." We deploy and characterize Face Recognition in an edge data center. The application is an AI-centric edge video analytics application built using popular open source infrastructure and ML tools. Despite using state-of-the-art AI and ML algorithms, the application relies heavily on pre-and post-processing code. As AI-centric applications benefit from the acceleration promised by accelerators, we find they impose stresses on the hardware and software infrastructure: storage and network bandwidth become major bottlenecks with increasing AI acceleration. By specializing for AI applications, we show that a purpose-built edge data center can be designed for the stresses of accelerated AI at 15% lower TCO than one derived from homogeneous servers and infrastructure.

[1]  Martin Kleppmann,et al.  Kafka, Samza and the Unix Philosophy of Distributed Data , 2015, IEEE Data Eng. Bull..

[2]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Measuring Inference Performance of Machine-Learning Frameworks on Edge- class Devices with the MLMarkTM Benchmark , 2019 .

[5]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[6]  Lingjia Tang,et al.  Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity , 2011, IEEE Computer Architecture Letters.

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[9]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[10]  Hadi Esmaeilzadeh,et al.  TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Wei Wei,et al.  AI Matrix: A Deep Learning Benchmark for Alibaba Data Centers , 2019, ArXiv.

[12]  Ankit Patel,et al.  Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[14]  Thomas Weise,et al.  Apache Apex , 2019, Encyclopedia of Big Data Technologies.

[15]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[16]  Jonathan Arnowitz,et al.  The case for case studies , 2005, INTR.

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Minghe Yu,et al.  AIBench: An Industry Standard Internet Service AI Benchmark Suite , 2019, ArXiv.

[19]  Andrea Cavallaro,et al.  Video Analytics for Surveillance: Theory and Practice [From the Guest Editors] , 2010 .

[20]  Olivia Freeman,et al.  Talking points personal outcomes approach: practical guide. , 2012 .

[21]  Luc Van Gool,et al.  AI Benchmark: All About Deep Learning on Smartphones in 2019 , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[22]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[23]  Thu D. Nguyen,et al.  Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Jim Groom,et al.  Docker - Build, Ship, and Run Any App, Anywhere , 2014 .

[25]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[28]  Luis Entrena,et al.  Hardware Architectures for Image Processing Acceleration , 2009 .

[29]  Ke Wang,et al.  AI Benchmark: Running Deep Neural Networks on Android Smartphones , 2018, ECCV Workshops.

[30]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[32]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.