论文信息 - Multi-speaker voice activity detection using a camera-assisted microphone array

Multi-speaker voice activity detection using a camera-assisted microphone array

We present a method for voice activity detection of multiple concurrent speakers using a camera-assisted microphone array. The proposed method uses face detection to identify locations of potential speech sources, and uses this information in an adaptive beamforming procedure to form a spatially directed detection algorithm to identify voice activity for individual speakers. Voice activity is classified using support vector machines with mel-frequency cepstrum coefficients as features. To increase the spatial filtering ability of the array we use a combination of Dolph-Chebyshev weighting and null-steering. We have carried out two experiments to gauge the accuracy of the proposed method, and obtain a representative accuracy of around 95% for single speakers, with around 1% loss of accuracy with two simultaneous speakers.

Sverre Holm | Ines Hafizovic | Trond F. Bergh | S. Holm | I. Hafizovic

[1] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2] Gunnar Farnebäck,et al. Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[3] Richard M. Dansereau,et al. Robust joint audio-video localization in video conferencing using reliability information , 2004, IEEE Transactions on Instrumentation and Measurement.

[4] Hong Liu,et al. Improved Voice Activity Detection based on support vector machine with high separable speech feature vectors , 2014, 2014 19th International Conference on Digital Signal Processing.

[5] Masakiyo Fujimoto,et al. A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.

[6] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[7] Rafik A. Goubran,et al. Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[8] Joseph H. DiBiase. A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[9] S. Furui,et al. Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Andreas Stolcke,et al. Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[11] S. Gökhun Tanyer,et al. Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[12] Giacomo Aletti,et al. Robust DOA estimation of speech signals via sparsity models using microphone arrays , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[13] Christian Jutten,et al. An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14] Javier Ramírez,et al. Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[15] Chuohao Yeo,et al. Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Anthony G. Constantinides,et al. Audio–Visual Active Speaker Tracking in Cluttered Indoors Environments , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17] P. Fränti,et al. Voice Activity Detection Using MFCC Features and Support Vector Machine , 2007 .

[18] Kah Phooi Seng,et al. Improved voice activity detection for speech recognition system , 2010, 2010 International Computer Symposium (ICS2010).

[19] Friedrich Faubel,et al. Improving hands-free speech recognition in a car through audio-visual voice activity detection , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[20] H.K. Ekenel,et al. Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[21] Ben P. Milner,et al. Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[22] Peng Liu,et al. Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23] Gautham J. Mysore,et al. Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[24] G. Carter,et al. The generalized correlation method for estimation of time delay , 1976 .

[25] Cláudio Rosito Jung,et al. Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM , 2015, IEEE Transactions on Multimedia.

[26] Hongzhi Wang,et al. Study on the MFCC similarity-based voice activity detection algorithm , 2011, 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC).

[27] Birger Kollmeier,et al. Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[28] Ines Hafizovic,et al. Design and implementation of a MEMS microphone array system for real-time speech acquisition , 2012 .

[29] Carlo Tomasi,et al. Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[30] Ji Wu,et al. Efficient Multiple Kernel Support Vector Machine Based Voice Activity Detection , 2011, IEEE Signal Processing Letters.

[31] Sven Nordholm,et al. Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Patrick Bauer,et al. A Particle Filtering Algorithm for Audiovisual Speaker Localisation , 2007, 2007 4th Workshop on Positioning, Navigation and Communication.