Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones

In this paper, we propose a speech enhancement algorithm for estimating the clean speech using samples of air-conducted and bone-conducted speech signals. We introduce a model in a supervised learning framework by approximating a mapping from concatenation of noisy air-conducted and bone-conducted speech to clean speech in the short time Fourier transform domain. Two function extension schemes are utilized: geometric harmonics and Laplacian pyramid. Performances obtained from the two schemes are evaluated and compared in terms of spectrograms and log spectral distance measures.

[1]  Sandhya Hawaldar,et al.  Speech Enhancement for Nonstationary Noise Environments , 2011 .

[2]  Kenji Kimura,et al.  A Study on Restoration of Bone-Conducted Speech with MTF-Based and LP-Based Models (Special Issue on Nonlinear Circuits and Signal Processing) , 2006 .

[3]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[4]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[5]  Ronald R. Coifman,et al.  Heterogeneous Datasets Representation and Learning using Diffusion Maps and Laplacian Pyramids , 2012, SDM.

[6]  Zicheng Liu,et al.  Multisensory processing for speech enhancement and magnitude-normalized spectra for speech modeling , 2008, Speech Commun..

[7]  G. A. Einicke,et al.  Smoothing, Filtering and Prediction - Estimating The Past, Present and Future , 2012 .

[8]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Thang tat Vu,et al.  An LP-based blind model for restoring bone-conducted speech , 2008, 2008 Second International Conference on Communications and Electronics.

[10]  Trym Holter,et al.  On the feasibility of ASR in extreme noise using the PARAT earplug communication terminal , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[11]  Zicheng Liu,et al.  Direct filtering for air- and bone-conductive microphones , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[12]  Kiyohiro Shikano,et al.  Accurate hidden Markov models for non-audible murmur (NAM) recognition based on iterative supervised adaptation , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[13]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[14]  Jitendra Malik,et al.  Efficient spatiotemporal grouping using the Nystrom method , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[15]  Martin Rothenberg A multichannel electroglottograph , 1992 .

[16]  Li Deng,et al.  A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Hong-Goo Kang,et al.  Survey of Speech Enhancement Supported by a Bone Conduction Microphone , 2012, ITG Conference on Speech Communication.

[18]  T. Shimamura,et al.  Improving Bone-Conducted Speech Quality via Neural Network , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[19]  Zicheng Liu,et al.  A graphical model for multi-sensory speech processing in air-and-bone conductive microphones , 2005, INTERSPEECH.

[20]  Dan Klein,et al.  Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[21]  Michael I. Jordan,et al.  Graphical models: Probabilistic inference , 2002 .

[22]  Gregory C. Burnett,et al.  The use of glottal electromagnetic micropower sensors (GEMS) in determining a voiced excitation function , 1999 .

[23]  Tetsuya Shimamura,et al.  Intelligibility enhancement of bone conducted speech by an analysis-synthesis method , 2011, 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS).

[24]  Tetsuya Shimamura,et al.  Low-frequency band noise suppression using bone conducted speech , 2011, Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[25]  Brendan J. Frey,et al.  Speech recognition in adverse environments: a probabilistic approach , 2002 .

[26]  J. Holzrichter,et al.  Speech articulator measurements using low power EM-wave sensors. , 1998, The Journal of the Acoustical Society of America.

[27]  John R. Hershey,et al.  Model-based fusion of bone and air sensors for speech enhancement and robust speech recognition , 2004, SAPA@INTERSPEECH.

[28]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[29]  Michael V. Scanlon Acoustic Sensor for Health Status Monitoring , 1998 .

[30]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[31]  Olivier Cappé,et al.  Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[32]  H. Franco,et al.  Combining standard and throat microphones for robust speech recognition , 2003, IEEE Signal Processing Letters.

[33]  Wei Chen,et al.  A Robust Speech Enhancement Scheme on The Basis of Bone-conductive Microphones , 2007, 2007 3rd International Workshop on Signal Design and Its Applications in Communications.

[34]  Tetsuya Shimamura,et al.  Pitch characteristics of bone conducted speech , 2010, 2010 18th European Signal Processing Conference.

[35]  Zicheng Liu,et al.  Leakage model and teeth clack removal for air- and bone-conductive integrated microphones , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Alex Acero,et al.  Robust bandwidth extension of noise-corrupted narrowband speech , 2005, INTERSPEECH.

[37]  Y. Ephraim,et al.  A Brief Survey of Speech Enhancement , 2003 .

[38]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[40]  R. Coifman,et al.  Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions , 2006 .

[41]  K. Nakagawa,et al.  On Equalization of Bone Conducted Speech for Improved Speech Quality , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[42]  William M. Campbell,et al.  Exploiting Nonacoustic Sensors for Speech Encoding , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Nicolas Le Roux,et al.  Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.

[44]  Yariv Ephraim,et al.  A Bayesian estimation approach for speech enhancement using hidden Markov models , 1992, IEEE Trans. Signal Process..

[45]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[46]  T. Shimamura,et al.  A reconstruction filter for bone-conducted speech , 2005, 48th Midwest Symposium on Circuits and Systems, 2005..

[47]  Wonyong Sung,et al.  A voice activity detector employing soft decision based noise spectrum adaptation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48]  T. Kristjansson,et al.  High resolution signal reconstruction , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[49]  Zicheng Liu,et al.  Nonlinear information fusion in multi-sensor processing - extracting and exploiting hidden dynamics of speech captured by a bone-conductive microphone , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[50]  Zicheng Liu,et al.  Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Israel Cohen,et al.  Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[52]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[53]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[54]  J. Kobler,et al.  Measurements of glottal structure dynamics. , 2005, The Journal of the Acoustical Society of America.

[55]  Xuedong Huang,et al.  Air- and bone-conductive integrated microphones for robust speech detection and enhancement , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[56]  Thomas F. Quatieri,et al.  Multisensor Dynamic Waveform Fusion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[57]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.