Multi-speaker voice activity detection using a camera-assisted microphone array

We present a method for voice activity detection of multiple concurrent speakers using a camera-assisted microphone array. The proposed method uses face detection to identify locations of potential speech sources, and uses this information in an adaptive beamforming procedure to form a spatially directed detection algorithm to identify voice activity for individual speakers. Voice activity is classified using support vector machines with mel-frequency cepstrum coefficients as features. To increase the spatial filtering ability of the array we use a combination of Dolph-Chebyshev weighting and null-steering. We have carried out two experiments to gauge the accuracy of the proposed method, and obtain a representative accuracy of around 95% for single speakers, with around 1% loss of accuracy with two simultaneous speakers.

[1]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[3]  Richard M. Dansereau,et al.  Robust joint audio-video localization in video conferencing using reliability information , 2004, IEEE Transactions on Instrumentation and Measurement.

[4]  Hong Liu,et al.  Improved Voice Activity Detection based on support vector machine with high separable speech feature vectors , 2014, 2014 19th International Conference on Digital Signal Processing.

[5]  Masakiyo Fujimoto,et al.  A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.

[6]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[7]  Rafik A. Goubran,et al.  Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[8]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[9]  S. Furui,et al.  Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[11]  S. Gökhun Tanyer,et al.  Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[12]  Giacomo Aletti,et al.  Robust DOA estimation of speech signals via sparsity models using microphone arrays , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[13]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[15]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Anthony G. Constantinides,et al.  Audio–Visual Active Speaker Tracking in Cluttered Indoors Environments , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  P. Fränti,et al.  Voice Activity Detection Using MFCC Features and Support Vector Machine , 2007 .

[18]  Kah Phooi Seng,et al.  Improved voice activity detection for speech recognition system , 2010, 2010 International Computer Symposium (ICS2010).

[19]  Friedrich Faubel,et al.  Improving hands-free speech recognition in a car through audio-visual voice activity detection , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[20]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[21]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[22]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Gautham J. Mysore,et al.  Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[24]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[25]  Cláudio Rosito Jung,et al.  Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM , 2015, IEEE Transactions on Multimedia.

[26]  Hongzhi Wang,et al.  Study on the MFCC similarity-based voice activity detection algorithm , 2011, 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC).

[27]  Birger Kollmeier,et al.  Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[28]  Ines Hafizovic,et al.  Design and implementation of a MEMS microphone array system for real-time speech acquisition , 2012 .

[29]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Ji Wu,et al.  Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection , 2011, IEEE Signal Processing Letters.

[31]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Patrick Bauer,et al.  A Particle Filtering Algorithm for Audiovisual Speaker Localisation , 2007, 2007 4th Workshop on Positioning, Navigation and Communication.