RESONATE: Reverberation environment simulation for improved classification of speech models

Home monitoring systems currently gather information about peoples activities of daily living and information regarding emergencies, however they currently lack the ability to track speech. Practical speech analysis solutions are needed to help monitor ongoing conditions such as depression, as the amount of social interaction and vocal affect is important for assessing mood and well-being. Although there are existing solutions that classify the identity and the mood of a speaker, when the acoustic signals are captured in reverberant environments they perform poorly. In this paper, we present a practical reverberation compensation method called RESONATE, which uses simulated room impulse responses to adapt a training corpus for use in multiple real reverberant rooms. We demonstrate that the system creates robust classifiers that perform within 5 - 10% of baseline accuracy of non-reverberant environments. We demonstrate and evaluate the performance of this matched condition strategy using a public dataset, and also in controlled experiments with six rooms, and two long-term and uncontrolled real deployments. We offer a practical implementation that performs collection, feature extraction, and classification on-node, and training and simulation of training sets on a base station or cloud service.

[1]  M. Alpert,et al.  Reflections of depression in acoustic measures of the patient's speech. , 2001, Journal of affective disorders.

[2]  Jun Li,et al.  Crowd++: unsupervised speaker count with smartphones , 2013, UbiComp.

[3]  Chau Khoa. Pham Noise robust voice activity detection , 2013 .

[4]  Roland Göcke,et al.  An Investigation of Depressed Speech Detection: Features and Normalization , 2011, INTERSPEECH.

[5]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[6]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Peter Vary,et al.  A binaural room impulse response database for the evaluation of dereverberation algorithms , 2009, 2009 16th International Conference on Digital Signal Processing.

[8]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[9]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[10]  Jie Liu,et al.  SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones , 2011, Pervasive.

[11]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[12]  A. Flint,et al.  Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in major depression. , 1993, Journal of psychiatric research.

[13]  Ahmet M. Kondoz,et al.  Improved voice activity detection based on a smoothed statistical likelihood ratio , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Juan Manuel Górriz,et al.  Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[16]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[17]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[18]  Peter Vary,et al.  An Improved Algorithm for Blind Reverberation Time Estimation , 2010 .

[19]  John A. Stankovic,et al.  Empath: a continuous remote emotional health monitoring system for depressive illness , 2011, Wireless Health.

[20]  Nozomu Hamada,et al.  Noise robust Voice Activity Detection for multiple speakers , 2010, 2010 International Symposium on Intelligent Signal Processing and Communication Systems.

[21]  Qiang Li,et al.  Auditeur: a mobile-cloud service platform for acoustic event detection on smartphones , 2013, MobiSys '13.

[22]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[23]  J. Mundt,et al.  Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology , 2007, Journal of Neurolinguistics.

[24]  Pascal Scalart,et al.  Improved Signal-to-Noise Ratio Estimation for Speech Enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.