论文信息 - [Paper] DLF-based Speech Segment Detection and Its Application to Audio Noise Removal for Video Conferences

[Paper] DLF-based Speech Segment Detection and Its Application to Audio Noise Removal for Video Conferences

A new decision-level fusion (DLF)-based speech segment detection method and its application to audio noise removal for video conferences are presented in this paper. The proposed method calculates visual and audio features from video sequences and audio signals, respectively, obtained in video conferences. Features extracted from mouth regions of participants and attribution degrees of speech class are used as visual and audio features, respectively, and Support Vector Machine (SVM)-based classification is performed by using each kind of feature. The SVM classifier performs two-class classification of speech and non-speech segments to realize speech segment detection. From the detection results obtained from the visual and audio features, DLF based on Supervised Learning from Multiple Experts is performed to successfully obtain the final detection results with focus on the accuracy of each detection result. Then, from audio signals in the non-speech segments detected by our method, we can extract noise information to realize accurate audio noise removal in the speech segments.

Miki Haseyama | Takahiro Ogawa | Sho Takahashi | Kazuto Sasaki

[1] Kiyoharu Aizawa,et al. Multi-stage object classification featuring confidence analysis of classifier and inclined local Naive Bayes nearest neighbor , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[2] Raul Camposano,et al. A novel analog fuzzy controller for intelligent sensors , 1995 .

[3] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4] Stefano Soatto,et al. Dynamic Textures , 2003, International Journal of Computer Vision.

[5] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[6] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[7] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8] Bogdan Gabrys,et al. Classifier selection for majority voting , 2005, Inf. Fusion.

[9] Miki Haseyama,et al. Audio-Based Shot Classification for Audiovisual Indexing Using PCA, MGD and Fuzzy Algorithm , 2007, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[10] Javier Ramírez,et al. An effective subband OSF-based VAD with noise reduction for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[11] J. Nagumo,et al. A learning method for system identification , 1967, IEEE Transactions on Automatic Control.

[12] Gerardo Hermosillo,et al. Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[13] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[14] N. Otsu,et al. Action and simultaneous multiple-person identification using cubic higher-order local auto-correlation , 2004, ICPR 2004.

[15] Birger Kollmeier,et al. Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[16] S. Boll,et al. Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[17] Jianwu Dang,et al. Noise estimation using a constrained sequential HMM IN log-spectral domain , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[19] Ioannis Pitas,et al. Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[20] Christian Jutten,et al. An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21] Chuen-Chien Lee,et al. Fuzzy logic in control systems: fuzzy logic controller. I , 1990, IEEE Trans. Syst. Man Cybern..

[22] Jane Labadin,et al. Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[23] Chuen-Chien Lee. FUZZY LOGIC CONTROL SYSTEMS: FUZZY LOGIC CONTROLLER - PART I , 1990 .

[24] Jacob Scharcanski,et al. Audiovisual Voice Activity Detection Based on Microphone Arrays and Color Information , 2013, IEEE Journal of Selected Topics in Signal Processing.

[25] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[26] T. Runkler,et al. A set of axioms for defuzzification strategies towards a theory of rational defuzzification operators , 1993, [Proceedings 1993] Second IEEE International Conference on Fuzzy Systems.

[27] Nobuyuki Yagi,et al. [Survey paper] A Review of Video Retrieval Based on Image and Video Semantic Understanding , 2013 .