Accelerating Mobile Audio Sensing Algorithms through On-Chip GPU Offloading

GPUs have recently enjoyed increased popularity as general purpose software accelerators in multiple application domains including computer vision and natural language processing. However, there has been little exploration into the performance and energy trade-offs mobile GPUs can deliver for the increasingly popular workload of deep-inference audio sensing tasks, such as, spoken keyword spotting in energy-constrained smartphones and wearables. In this paper, we study these trade-offs and introduce an optimization engine that leverages a series of structural and memory access optimization techniques that allow audio algorithm performance to be automatically tuned as a function of GPU device specifications and model semantics. We find that parameter optimized audio routines obtain inferences an order of magnitude faster than sequential CPU implementations, and up to 6.5x times faster than cloud offloading with good connectivity, while critically consuming 3-4x less energy than the CPU. Under our optimized GPU, conventional wisdom about how to use the cloud and low power chips is broken. Unless the network has a throughput of at least 20Mbps (and a RTT of 25 ms or less), with only about 10 to 20 seconds of buffering audio data for batched execution, the optimized GPU audio sensing apps begin to consume less energy than cloud offloading. Under such conditions we find the optimized GPU can provide energy benefits comparable to low-power reference DSP implementations with some preliminary level of optimization; in addition to the GPU always winning with lower latency.

[1]  Gu-Yeon Wei,et al.  Helix: Making the Extraction of Thread-Level Parallelism Mainstream , 2012, IEEE Micro.

[2]  Prabal Dutta,et al.  AudioDAQ: turning the mobile phone's ubiquitous headset port into a universal data acquisition interface , 2012, SenSys '12.

[3]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[4]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[5]  Jun Li,et al.  Crowd++: unsupervised speaker count with smartphones , 2013, UbiComp.

[6]  Qing Guo,et al.  Balancing energy, latency and accuracy for mobile sensor data classification , 2011, SenSys.

[7]  Erich Schikuta,et al.  Multicore and GPU Parallelization of Neural Networks for Face Recognition , 2013, ICCS.

[8]  Sungdae Cho,et al.  Implementation and optimization of image processing algorithms on handheld GPU , 2010, 2010 IEEE International Conference on Image Processing.

[9]  John D. Owens,et al.  Compute & memory optimizations for high-quality speech recognition on low-end GPU processors , 2011, 2011 18th International Conference on High Performance Computing.

[10]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[11]  Joseph R. Cavallaro,et al.  Accelerating computer vision algorithms using OpenCL framework on the mobile GPU - A case study , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Cecilia Mascolo,et al.  EmotionSense: a mobile phones based adaptive platform for experimental social psychology research , 2010, UbiComp.

[13]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[14]  Inseok Hwang,et al.  SocioPhone: everyday face-to-face interaction monitoring platform using multi-phone sensor fusion , 2013, MobiSys '13.

[15]  Zhigang Liu,et al.  The Jigsaw continuous sensing engine for mobile phone applications , 2010, SenSys '10.

[16]  Nicholas D. Lane,et al.  DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices , 2016, 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

[17]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[18]  Jie Liu,et al.  SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones , 2011, Pervasive.

[19]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[20]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[21]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[22]  Mun Choon Chan,et al.  SocialWeaver: collaborative inference of human conversation networks using smartphones , 2013, SenSys '13.

[23]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[24]  Jie Liu,et al.  LittleRock: Enabling Energy-Efficient Continuous Sensing on Mobile Phones , 2011, IEEE Pervasive Computing.

[25]  James R. Glass,et al.  Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Gu-Yeon Wei,et al.  HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[27]  Daniel Gatica-Perez,et al.  StressSense: detecting stress in unconstrained acoustic environments using smartphones , 2012, UbiComp.

[28]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[29]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[30]  Mani B. Srivastava,et al.  Exploiting processor heterogeneity for energy efficient context inference on mobile phones , 2013, HotPower '13.

[31]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Cecilia Mascolo,et al.  LEO: scheduling sensor inference algorithms across heterogeneous mobile processors and network resources , 2016, MobiCom.

[33]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[34]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[35]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[36]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Cecilia Mascolo,et al.  DSP.Ear: leveraging co-processor support for continuous audio sensing on smartphones , 2014, SenSys.

[38]  Kwang-Ting Cheng,et al.  Using mobile GPU for general-purpose computing – a case study of face recognition on smartphones , 2011, Proceedings of 2011 International Symposium on VLSI Design, Automation and Test.

[39]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[40]  Nicholas D. Lane,et al.  DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning , 2015, UbiComp.

[41]  Sungdae Cho,et al.  Design and Performance Evaluation of Image Processing Algorithms on GPUs , 2011, IEEE Transactions on Parallel and Distributed Systems.

[42]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[43]  Nicholas D. Lane,et al.  An Early Resource Characterization of Deep Learning on Wearables, Smartphones and Internet-of-Things Devices , 2015, IoT-App@SenSys.