DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning

Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar -- the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone's battery daily.

[1]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[2]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[4]  Yifan Gong,et al.  Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks , 2014, INTERSPEECH.

[5]  Mi Zhang,et al.  BodyBeat: a mobile system for sensing non-speech body sounds , 2014, MobiSys.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[8]  Hermann Ney,et al.  A Deep Learning Approach to Machine Transliteration , 2009, WMT@EACL.

[9]  Zhigang Liu,et al.  The Jigsaw continuous sensing engine for mobile phone applications , 2010, SenSys '10.

[10]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[11]  Nicholas D. Lane,et al.  Can Deep Learning Revolutionize Mobile Sensing? , 2015, HotMobile.

[12]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[15]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[16]  Richard Walker,et al.  PD Disease State Assessment in Naturalistic Environments Using Deep Learning , 2015, AAAI.

[17]  Mary Baker,et al.  The sound of silence , 2013, SenSys '13.

[19]  Cecilia Mascolo,et al.  DSP.Ear: leveraging co-processor support for continuous audio sensing on smartphones , 2014, SenSys.

[20]  Zhigang Liu,et al.  Darwin phones: the evolution of sensing and inference on mobile phones , 2010, MobiSys '10.

[21]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[22]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[23]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[24]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[25]  Daniel Gatica-Perez,et al.  StressSense: detecting stress in unconstrained acoustic environments using smartphones , 2012, UbiComp.

[26]  Yann LeCun,et al.  Feature learning and deep architectures: new directions for music informatics , 2013, Journal of Intelligent Information Systems.

[27]  Hojung Cha,et al.  Understanding the coverage and scalability of place-centric crowdsensing , 2013, UbiComp.

[28]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Paul Lukowicz,et al.  Analysis of Chewing Sounds for Dietary Monitoring , 2005, UbiComp.

[30]  Les E. Atlas,et al.  Acoustic diversity for improved speech recognition in reverberant environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[32]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[33]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  Cecilia Mascolo,et al.  EmotionSense: a mobile phones based adaptive platform for experimental social psychology research , 2010, UbiComp.

[36]  Andrew T. Campbell,et al.  Community-Guided Learning: Exploiting Mobile Sensor Users to Model Human Behavior , 2010, AAAI.

[37]  Jun Li,et al.  Crowd++: unsupervised speaker count with smartphones , 2013, UbiComp.

[38]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Ye Xu,et al.  Enabling large-scale human activity inference on smartphones using community similarity networks (csn) , 2011, UbiComp '11.

[40]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[41]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[42]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[43]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[44]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[45]  Ramachandran Ramjee,et al.  Nericell: rich monitoring of road and traffic conditions using mobile smartphones , 2008, SenSys '08.

[46]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Ramachandran Ramjee,et al.  Nericell: using mobile smartphones for rich monitoring of road and traffic conditions , 2008, SenSys '08.

[48]  Mani B. Srivastava,et al.  Exploiting processor heterogeneity for energy efficient context inference on mobile phones , 2013, HotPower '13.

[49]  Inseok Hwang,et al.  SocioPhone: everyday face-to-face interaction monitoring platform using multi-phone sensor fusion , 2013, MobiSys '13.

[50]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[51]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[52]  Jie Liu,et al.  Local business ambience characterization through mobile audio sensing , 2014, WWW.

[53]  Justin Salamon,et al.  Sensing Urban Soundscapes , 2014, EDBT/ICDT Workshops.

[54]  Tanzeem Choudhury,et al.  Passive and In-Situ assessment of mental and physical well-being using mobile sensors , 2011, UbiComp '11.

[55]  Wei Pan,et al.  SoundSense: scalable sound sensing for people-centric applications on mobile phones , 2009, MobiSys '09.

[56]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Kai Feng,et al.  SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION , 2009 .

[58]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[59]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[60]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[61]  Gernot A. Fink,et al.  Integrating speaker identification and learning with adaptive speech recognition , 2004, Odyssey.

[62]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[63]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[64]  Giorgio Metta,et al.  An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Jie Liu,et al.  SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones , 2011, Pervasive.

[66]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[67]  Alec Wolman,et al.  Helping mobile apps bootstrap with fewer users , 2012, UbiComp.

[68]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[69]  Eric C. Larson,et al.  Accurate and privacy preserving cough sensing using a low-cost microphone , 2011, UbiComp '11.

[70]  Emiliano Miluzzo,et al.  Pocket, Bag, Hand, etc. - Automatically Detecting Phone Context through Discovery , 2010 .