Multimodal speech recognition with ultrasonic sensors

In this research we explore multimodal speech recognition by augmenting acoustic information with that obtained by an ultrasonic emitter and receiver. After designing a hardware component to generate a stereo audio/ultrasound signal, we extract sub-band ultrasonic features that supplement conventional MFCC-based audio measurements. A simple interpolation method is used to combine audio and ultrasound model likelihoods. Experiments performed on a noisy continuous digit recognition task indicate that the addition of ultrasonic information reduces word error rates by 24-29% over a wide range of acoustic SNR (20-0 dB). Index Terms: multimodal, ultrasonic speech recognition

[1]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..

[2]  Leonardo Neumeyer,et al.  Probabilistic optimum filtering for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[4]  T. Nelson,et al.  Three-dimensional ultrasound imaging. , 1998, Ultrasound in medicine & biology.

[5]  Bhiksha Raj,et al.  Ultrasonic Doppler sensor for speaker recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Dennis W. Ruck,et al.  Enhancing automatic speech recognition with an ultrasonic lip motion detector , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[8]  S. Roucos,et al.  Word recognition using multisensor speech input in high ambient noise , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[10]  P Holmberg Robust ultrasonic range finder-an FFT analysis , 1992 .

[11]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Lynne E. Bernstein,et al.  For speech perception by humans or machines, three senses are better than one , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[14]  H. Franco,et al.  Combining standard and throat microphones for robust speech recognition , 2003, IEEE Signal Processing Letters.

[15]  Zicheng Liu,et al.  Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Bhiksha Raj,et al.  AN ACOUSTIC DOPPLER-BASED FRONT END FOR HANDS FREE SPOKEN USER INTERFACES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[17]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[18]  John F. Holzrichter,et al.  Denoising of human speech using combined acoustic and EM sensor signal processing , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Sadaoki Furui,et al.  A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  A Fenster,et al.  Three-dimensional ultrasound imaging. , 2000, Annual review of biomedical engineering.

[21]  R Plomp,et al.  Speechreading supplemented with auditorily presented speech parameters. , 1986, The Journal of the Acoustical Society of America.

[22]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[23]  Richard M. Stern,et al.  Voice driven applications in non-stationary and chaotic environment , 2005, 2005 IEEE International Conference on Robotics and Biomimetics - ROBIO.

[24]  Thomas S. Huang,et al.  Audio-visual speech modeling using coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.