Real-Time Activity Detection in a Multi-Talker Reverberated Environment

This paper proposes a real-time person activity detection framework operating in presence of multiple sources in reverberated environments. Such a framework is composed by two main parts: The speech enhancement front-end and the activity detector. The aim of the former is to automatically reduce the distortions introduced by room reverberation in the available distant speech signals and thus to achieve a significant improvement of speech quality for each speaker. The overall front-end is composed by three cooperating blocks, each one fulfilling a specific task: Speaker diarization, room impulse responses identification, and speech dereverberation. In particular, the speaker diarization algorithm is essential to pilot the operations performed in the other two stages in accordance with speakers’ activity in the room. The activity estimation algorithm is based on bidirectional Long Short-Term Memory networks which allow for context-sensitive activity classification from audio feature functionals extracted via the real-time speech feature extraction toolkit openSMILE. Extensive computer simulations have been performed by using a subset of the AMI database for activity evaluation in meetings: Obtained results confirm the effectiveness of the approach.

[1]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[2]  David G. Long,et al.  Array signal processing , 1985, IEEE Trans. Acoust. Speech Signal Process..

[3]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[4]  C. Burrus,et al.  Array Signal Processing , 1989 .

[5]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[6]  T. Kailath,et al.  A least-squares approach to blind channel identification , 1995, IEEE Trans. Signal Process..

[7]  Leslie G. Valiant,et al.  Cognitive computation , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  M. Sondhi,et al.  On the evaluation of estimated impulse responses , 1998, IEEE Signal Processing Letters.

[10]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[11]  Jacob Benesty,et al.  A class of frequency-domain adaptive approaches to blind multichannel identification , 2003, IEEE Trans. Signal Process..

[12]  Meng Hwa Er,et al.  A robust adaptive blind multichannel identification algorithm for acoustic applications , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Edward Y. Chang,et al.  Proceedings of the ACM 2nd international workshop on Video surveillance & sensor networks , 2004, MM 2004.

[14]  Samy Bengio,et al.  Multimodal group action clustering in meetings , 2004, VSSN '04.

[15]  Jacob Benesty,et al.  A blind channel identification-based two-stage approach to separation and dereverberation of speech signals in a reverberant environment , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17]  H. Engl,et al.  Tikhonov regularization applied to the inverse problem of option pricing: convergence analysis and rates , 2005 .

[18]  Samy Bengio,et al.  Detecting group interest-level in meetings , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[21]  Gaël Richard,et al.  Iterative algorithms for multichannel equalization in sound reproduction systems , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  Fabio Pianesi,et al.  Automatic detection of group functional roles in face to face interactions , 2006, ICMI '06.

[23]  Björn W. Schuller,et al.  Segmentation and Recognition of Meeting Events using a Two-Layered HMM and a Combined MLP-HMM Approach , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[24]  Marc Delcroix,et al.  Inverse Filtering for Speech Dereverberation Less Sensitive to Noise and Room Transfer Function Fluctuations , 2007, EURASIP J. Adv. Signal Process..

[25]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[26]  Addisson Salazar,et al.  Optimum Detection of Ultrasonic Echoes Applied to the Analysis of the First Layer of a Restored Dome , 2007, EURASIP J. Adv. Signal Process..

[27]  Keikichi Hirose,et al.  Energy constrained frequency-domain normalized LMS algorithm for blind channel identification , 2007, Signal Image Video Process..

[28]  Md. Kamrul Hasan,et al.  Noise Robust Multichannel Frequency-Domain LMS Algorithms for Blind Channel Identification , 2008, IEEE Signal Processing Letters.

[29]  Gerald Friedland,et al.  Towards Semantic Analysis of Conversations: A System for the Live Identification of Speakers in Meetings , 2008, 2008 IEEE International Conference on Semantic Computing.

[30]  Andrei Popescu-Belis,et al.  Machine Learning for Multimodal Interaction , 4th International Workshop, MLMI 2007, Brno, Czech Republic, June 28-30, 2007, Revised Selected Papers , 2008, MLMI.

[31]  Nadia Mana,et al.  Multimodal recognition of personality traits in social interactions , 2008, ICMI '08.

[32]  Stefano Squartini,et al.  A robust iterative inverse filtering approach for speech dereverberation in presence of disturbances , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[33]  Daniel Gatica-Perez,et al.  Automatic nonverbal analysis of social interaction in small groups: A review , 2009, Image Vis. Comput..

[34]  Chuohao Yeo,et al.  Modeling Dominance in Group Conversations Using Nonverbal Activity Cues , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Francesco Piazza,et al.  Keyword spotting based system for conversation fostering in tabletop scenarios: Preliminary evaluation , 2009, 2009 2nd Conference on Human System Interactions.

[36]  Gerhard Rigoll,et al.  Multi-modal activity and dominance detection in smart meeting rooms , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Fabien Ringeval,et al.  Time-Scale Feature Extractions for Emotional Speech Characterization , 2009, Cognitive Computation.

[38]  Nicholas W. D. Evans,et al.  The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[40]  Stefano Squartini,et al.  Joint Multichannel Blind Speech Separation and Dereverberation: A Real-Time Algorithmic Implementation , 2010, ICIC.

[41]  Daniel Gatica-Perez,et al.  Fusing Audio-Visual Nonverbal Cues to Detect Dominant People in Group Conversations , 2010, 2010 20th International Conference on Pattern Recognition.

[42]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[43]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[44]  Björn W. Schuller,et al.  Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework , 2010, Cognitive Computation.

[45]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[46]  Björn W. Schuller,et al.  Online Driver Distraction Detection Using Long Short-Term Memory , 2011, IEEE Transactions on Intelligent Transportation Systems.

[47]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[48]  Gerald Friedland,et al.  Estimating Dominance in Multi-Party Meetings Using Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Björn Schuller,et al.  On-line Driver Distraction Detection using Long Short-Term Memory , 2011 .

[50]  Erik Marchi,et al.  Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting , 2011, Cognitive Neurodynamics.

[51]  Björn W. Schuller,et al.  Real-Time Speech Recognition in a Multi-talker Reverberated Acoustic Scenario , 2011, ICIC.

[52]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..