Audio Event Recognition in the Smart Home

After giving a brief overview of the relevance and value of deploying automatic audio event recognition (AER) in the smart home market, this chapter reviews three aspects of the productization of AER which are important to consider when developing pathways to impact between fundamental research and “real-world” applicative outlets. In the first section, it is shown that applications introduce a variety of practical constraints which elicit new research topics in the field: clarifying the definition of sound events, thus suggesting interest for the explicit modeling of temporal patterns and interruption; running and evaluating AER in 24/7 sound detection setups, which suggests to recast the problem as open-set recognition; and running AER applications on consumer devices with limited audio quality and computational power, thus triggering interest for scalability and robustness. The second section explores the definition of user experience for AER. After reporting field observations about the ways in which system errors affect user experience, it is proposed to introduce opinion scoring into AER evaluation methodology. Then, the link between standard AER performance metrics and subjective user experience metrics is being explored, and attention is being drawn to the fact that F-score metrics actually mash up the objective evaluation of acoustic discrimination with the subjective choice of an application-dependent operation point. Solutions to the separation of discrimination and calibration in system evaluation are introduced, thus allowing the more explicit separation of acoustic modeling optimization from that of application-dependent user experience. Finally, the last section analyses the ethical and legal issues involved in deploying AER systems which are “listening” at all times into the users’ private space. A review of the key notions underpinning European data and privacy protection laws, questioning if and when these apply to audio data, suggests a set of guidelines which summarize into empowering users to consent by fully informing them about the use of their data, as well as taking reasonable information security measures to protect users’ personal data.

[1]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[3]  Frederic P. Miller,et al.  Commission nationale de l'Informatique et des libertés : loi relative à l'informatique, aux fichiers et aux libertés du 6 janvier 1978, vie privée, données personnelles, fichage en France , 2010 .

[4]  Louise Corti,et al.  Managing and Sharing Research Data , 2014 .

[5]  David A. van Leeuwen,et al.  An Introduction to Application-Independent Evaluation of Speaker Recognition Systems , 2007, Speaker Classification.

[6]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[8]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[9]  Bob L. Sturm A Simple Method to Determine if a Music Information Retrieval System is a “Horse” , 2014, IEEE Transactions on Multimedia.

[10]  Vagelis Papakonstantinou,et al.  The Data Protection Regime in China. In-Depth Analysis , 2015 .

[11]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[12]  Nicholas W. D. Evans,et al.  The open-set problem in acoustic scene classification , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[13]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[14]  Martin A. Weiss,et al.  U.S.-EU Data Privacy: From Safe Harbor to Privacy Shield [May 19, 2016] , 2016 .

[15]  Jean Carletta,et al.  Multimodal Signal Processing , 2012 .

[16]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[17]  Francesco Nesta,et al.  Supervised independent vector analysis through pilot dependent components , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Sebastian Möller,et al.  Estimating the Quality of Synthesized and Natural Speech Transmitted Through Telephone Networks Using Single-ended Prediction Models , 2008 .

[19]  Douglas A. Reynolds,et al.  Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[20]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[21]  Wayne H. Ward,et al.  Speech recognition , 1997 .

[22]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[23]  Terrance E. Boult,et al.  Towards Open Set Deep Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Vermesan Ovidiu,et al.  Internet of Things Strategic Research and Innovation Agenda , 2014 .

[25]  Anderson Rocha,et al.  Toward Open Set Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Manas A. Pathak,et al.  Privacy-Preserving Machine Learning for Speech Processing , 2012 .

[27]  Carlo Maria Medaglia,et al.  An Overview of Privacy and Security Issues in the Internet of Things , 2010 .

[28]  Simon King,et al.  Statistical analysis of the Blizzard Challenge 2007 listening test results , 2007 .

[29]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[30]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[31]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[32]  Yi-Hsuan Yang,et al.  A Systematic Evaluation of the Bag-of-Frames Representation for Music Information Retrieval , 2014, IEEE Transactions on Multimedia.

[33]  Denis Regaud Commission Nationale de l'Informatique et des Libertés , 2016 .

[34]  Rita Yi Man Li,et al.  Sustainable Smart Home and Home Automation: Big Data Analytics Approach , 2016 .

[35]  Oriol Nieto,et al.  Perceptual Analysis of the F-Measure to Evaluate Section Boundaries in Music , 2014, ISMIR.

[36]  C. F. Hockett The origin of speech. , 1960, Scientific American.

[37]  Heiga Zen,et al.  State Duration Modeling for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[38]  Sacha Krstulovic,et al.  Automatic Environmental Sound Recognition: Performance Versus Computational Cost , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  David Suendermann,et al.  Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment , 2013 .

[40]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[41]  Bhiksha Raj,et al.  Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise , 2013, IEEE Signal Processing Magazine.

[42]  Jiang Zhu,et al.  Fog Computing: A Platform for Internet of Things and Analytics , 2014, Big Data and Internet of Things.

[43]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[44]  Simon King,et al.  Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation , 2014 .

[45]  Maria Fazio,et al.  Enabling Secure XMPP Communications in Federated IoT Clouds Through XEP 0027 and SAML/SASL SSO , 2017, Sensors.

[46]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.