Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Dennis Goeckel,et al.  A dynamically reconfigurable adaptive viterbi decoder , 2002, FPGA '02.

[5]  Scott A. Mahlke,et al.  Architectural optimizations for low-power, real-time speech recognition , 2003, CASES '03.

[6]  Zhen Fang,et al.  A low-power accelerator for the SPHINX 3 speech recognition system , 2003, CASES '03.

[7]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Rob A. Rutenbar,et al.  A 1000-word vocabulary, speaker-independent, continuous live-mode speech recognizer implemented in a single FPGA , 2007, FPGA '07.

[9]  Viktor K. Prasanna,et al.  Compact architecture for high-throughput regular expression matching on FPGA , 2008, ANCS '08.

[10]  Sadaoki Furui,et al.  Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition , 2009, Comput. Speech Lang..

[11]  Sotiris Ioannidis,et al.  Regular Expression Matching on Graphics Hardware for Intrusion Detection , 2009, RAID.

[12]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[13]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[14]  Ankush Mittal,et al.  Achieving magnitude order improvement in Porter stemmer algorithm over multi-core architecture , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[15]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[16]  Ioannis Papaefstathiou,et al.  Fast and Efficient FPGA-Based Feature Detection Employing the SURF Algorithm , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[17]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[18]  Huizhong Chen,et al.  The stanford mobile visual search data set , 2011, MMSys.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[21]  Hermann Ney,et al.  RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit , 2011 .

[22]  Zhen Fang,et al.  CogniServe: Heterogeneous Server Architecture for Large-Scale Recognition , 2011, IEEE Micro.

[23]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[24]  Marti A. Hearst 'Natural' search user interfaces , 2011, CACM.

[25]  Berin Martini,et al.  Large-Scale FPGA-based Convolutional Networks , 2011 .

[26]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Kurt Keutzer,et al.  Efficient Automatic Speech Recognition on the GPU , 2011 .

[28]  Lingjia Tang,et al.  Increasing Utilization in Modern Warehouse-Scale Computers Using Bubble-Up , 2012, IEEE Micro.

[29]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[31]  Christoph Hagleitner,et al.  Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  Ian R. Lane,et al.  Efficient On-The-Fly Hypothesis Rescoring in a Hybrid GPU/CPU-based Large Vocabulary Continuous Speech Recognition Engine , 2012, INTERSPEECH.

[33]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[35]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[36]  Wei Wang,et al.  ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers , 2013, ASPLOS '13.

[37]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[38]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[39]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[41]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[42]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[43]  Xiao Zhang,et al.  Optimizing Google's warehouse scale computers: The NUMA experience , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[44]  Joakim Nivre,et al.  Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging , 2013, TACL.

[45]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[47]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[48]  Ronald G. Dreslinski,et al.  A hybrid approach to offloading mobile image classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Vu-Duc Ngo,et al.  High throughput FPGA architecture for corner detection in traffic images , 2014, 2014 IEEE Fifth International Conference on Communications and Electronics (ICCE).

[50]  Rong Luo,et al.  Accelerating frequent item counting with FPGA , 2014, FPGA.

[51]  James Baker,et al.  A historical perspective of speech recognition , 2014, CACM.

[52]  Lingjia Tang,et al.  Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[53]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[54]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[55]  Daniel Mossé,et al.  Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[56]  Eric S. Chung,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).