论文信息 - Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones

Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones

In this paper, we propose a speech enhancement algorithm for estimating the clean speech using samples of air-conducted and bone-conducted speech signals. We introduce a model in a supervised learning framework by approximating a mapping from concatenation of noisy air-conducted and bone-conducted speech to clean speech in the short time Fourier transform domain. Two function extension schemes are utilized: geometric harmonics and Laplacian pyramid. Performances obtained from the two schemes are evaluated and compared in terms of spectrograms and log spectral distance measures.

[1] Sandhya Hawaldar,et al. Speech Enhancement for Nonstationary Noise Environments , 2011 .

[2] Kenji Kimura,et al. A Study on Restoration of Bone-Conducted Speech with MTF-Based and LP-Based Models (Special Issue on Nonlinear Circuits and Signal Processing) , 2006 .

[3] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[4] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[5] Ronald R. Coifman,et al. Heterogeneous Datasets Representation and Learning using Diffusion Maps and Laplacian Pyramids , 2012, SDM.

[6] Zicheng Liu,et al. Multisensory processing for speech enhancement and magnitude-normalized spectra for speech modeling , 2008, Speech Commun..

[7] G. A. Einicke,et al. Smoothing, Filtering and Prediction - Estimating The Past, Present and Future , 2012 .

[8] Richard M. Stern,et al. A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9] Thang tat Vu,et al. An LP-based blind model for restoring bone-conducted speech , 2008, 2008 Second International Conference on Communications and Electronics.

[10] Trym Holter,et al. On the feasibility of ASR in extreme noise using the PARAT earplug communication terminal , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[11] Zicheng Liu,et al. Direct filtering for air- and bone-conductive microphones , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[12] Kiyohiro Shikano,et al. Accurate hidden Markov models for non-audible murmur (NAM) recognition based on iterative supervised adaptation , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[13] Matthias W. Seeger,et al. Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[14] Jitendra Malik,et al. Efficient spatiotemporal grouping using the Nystrom method , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[15] Martin Rothenberg. A multichannel electroglottograph , 1992 .

[16] Li Deng,et al. A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Hong-Goo Kang,et al. Survey of Speech Enhancement Supported by a Bone Conduction Microphone , 2012, ITG Conference on Speech Communication.

[18] T. Shimamura,et al. Improving Bone-Conducted Speech Quality via Neural Network , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[19] Zicheng Liu,et al. A graphical model for multi-sensory speech processing in air-and-bone conductive microphones , 2005, INTERSPEECH.

[20] Dan Klein,et al. Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[21] Michael I. Jordan,et al. Graphical models: Probabilistic inference , 2002 .

[22] Gregory C. Burnett,et al. The use of glottal electromagnetic micropower sensors (GEMS) in determining a voiced excitation function , 1999 .

[23] Tetsuya Shimamura,et al. Intelligibility enhancement of bone conducted speech by an analysis-synthesis method , 2011, 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS).

[24] Tetsuya Shimamura,et al. Low-frequency band noise suppression using bone conducted speech , 2011, Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[25] Brendan J. Frey,et al. Speech recognition in adverse environments: a probabilistic approach , 2002 .

[26] J. Holzrichter,et al. Speech articulator measurements using low power EM-wave sensors. , 1998, The Journal of the Acoustical Society of America.

[27] John R. Hershey,et al. Model-based fusion of bone and air sensors for speech enhancement and robust speech recognition , 2004, SAPA@INTERSPEECH.

[28] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[29] Michael V. Scanlon. Acoustic Sensor for Health Status Monitoring , 1998 .

[30] A.V. Oppenheim,et al. Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[31] Olivier Cappé,et al. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[32] H. Franco,et al. Combining standard and throat microphones for robust speech recognition , 2003, IEEE Signal Processing Letters.

[33] Wei Chen,et al. A Robust Speech Enhancement Scheme on The Basis of Bone-conductive Microphones , 2007, 2007 3rd International Workshop on Signal Design and Its Applications in Communications.

[34] Tetsuya Shimamura,et al. Pitch characteristics of bone conducted speech , 2010, 2010 18th European Signal Processing Conference.

[35] Zicheng Liu,et al. Leakage model and teeth clack removal for air- and bone-conductive integrated microphones , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36] Alex Acero,et al. Robust bandwidth extension of noise-corrupted narrowband speech , 2005, INTERSPEECH.

[37] Y. Ephraim,et al. A Brief Survey of Speech Enhancement , 2003 .

[38] Nebojsa Jojic,et al. A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[39] Ephraim. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[40] R. Coifman,et al. Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions , 2006 .

[41] K. Nakagawa,et al. On Equalization of Bone Conducted Speech for Improved Speech Quality , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[42] William M. Campbell,et al. Exploiting Nonacoustic Sensors for Speech Encoding , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[43] Nicolas Le Roux,et al. Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.

[44] Yariv Ephraim,et al. A Bayesian estimation approach for speech enhancement using hidden Markov models , 1992, IEEE Trans. Signal Process..

[45] Jeff A. Bilmes,et al. Graphical models and automatic speech recognition , 2002 .

[46] T. Shimamura,et al. A reconstruction filter for bone-conducted speech , 2005, 48th Midwest Symposium on Circuits and Systems, 2005..

[47] Wonyong Sung,et al. A voice activity detector employing soft decision based noise spectrum adaptation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48] T. Kristjansson,et al. High resolution signal reconstruction , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[49] Zicheng Liu,et al. Nonlinear information fusion in multi-sensor processing - extracting and exploiting hidden dynamics of speech captured by a bone-conductive microphone , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[50] Zicheng Liu,et al. Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51] Israel Cohen,et al. Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[52] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[53] Hynek Hermansky,et al. RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[54] J. Kobler,et al. Measurements of glottal structure dynamics. , 2005, The Journal of the Acoustical Society of America.

[55] Xuedong Huang,et al. Air- and bone-conductive integrated microphones for robust speech detection and enhancement , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[56] Thomas F. Quatieri,et al. Multisensor Dynamic Waveform Fusion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[57] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.