Robust Speech Recognition System for Communication Robots in Real Environments

The application range of communication robots could be widely expanded by the use of an automatic speech recognition (ASR) system with improved robustness for noise and for speakers of different ages. In this paper, we describe an ASR system which can robustly recognize speech by adults and children in noisy environments. We evaluate the ASR system in a communication robot placed in a real noisy environment. Speech is captured using a twelve-element microphone array arranged in the robot chest. To suppress interference and noise and to attenuate reverberation, we implemented a multi-channel system consisting of an outlier-robust generalized sidelobe canceller (RGSC) technique and a feature-space noise suppression using MMSE criteria. Speech activity periods are detected using GMM-based end-point detection (GMM-EPD). Our ASR system has two decoders for adults' and children's speech. The final hypothesis is selected based on posterior probability. We then assign a generalized word posterior probability (GWPP)-based confidence measure to this hypothesis, and if it is higher than a threshold, we transfer it to a subsequent dialog processing module. The performance of each step was evaluated for adults' and children's speech, by adding different levels of real environment noise recorded in a cafeteria. Experimental results indicated that our ASR system could achieve over 80 % word accuracy in 70 dBA noise. Further evaluation of adult speech recorded in a real noisy environment resulted in 73 % word accuracy

[1]  Satoru Hayamizu,et al.  Socially Embedded Learning of the Office-Conversant Mobil Robot Jijo-2 , 1997, IJCAI.

[2]  Wolfram Burgard,et al.  The Interactive Museum Tour-Guide Robot , 1998, AAAI/IAAI.

[3]  Brian Scassellati,et al.  A Context-Dependent Attention System for a Social Robot , 1999, IJCAI.

[4]  Hiroaki Kitano,et al.  Real-Time Auditory and Visual Multiple-Object Tracking for Humanoids , 2001, IJCAI.

[5]  K. Nakadai,et al.  Real-Time Auditory and Visual Multiple-Object Tracking for Robots , 2001, IJCAI 2001.

[6]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[7]  Masahiro Fujita,et al.  AIBO: Toward the Era of Digital Creatures , 2001, Int. J. Robotics Res..

[8]  Hiroaki Kitano,et al.  Applying scattering theory to robot audition system: robust sound source localization and extraction , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[9]  Takayuki Kanda,et al.  Interactive Robots as Social Partners and Peer Tutors for Children: A Field Trial , 2004, Hum. Comput. Interact..

[10]  Satoshi Nakamura,et al.  Automatic Generation of Non-uniform HMM Topologies Based on the MDL Criterion , 2004, IEICE Trans. Inf. Syst..

[11]  Takanori Shibata,et al.  An overview of human interactive robots for psychological enrichment , 2004, Proceedings of the IEEE.

[12]  F. K. Soong Generalized word posterior probability (GWPP) for measuring reliability of recognized words , 2004 .

[13]  Wolfgang Herbordt,et al.  Application of a double-talk resilient DFT domain adaptive filter for bin-wise stepsize controls to adaptive beamforming , 2005 .

[14]  Kiyohiro Shikano,et al.  Blind sound scene decomposition for robot audition using SIMO-model-based ICA , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Kiyohiro Shikano,et al.  Noise-robust hands-free speech recognition based on spatial subtraction array and known noise superimposition , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Jean Rouat,et al.  Making a robot recognize three simultaneous sentences in real-time , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  T. Horiuchi,et al.  Hands-free speech recognition and communication on PDAs using microphone array technology , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[18]  Kazuya Takeda,et al.  Adaptive Nonlinear Regression Using Multiple Distributed Microphones for In-Car Speech Recognition , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[19]  Takayuki Kanda,et al.  Three-layered draw-attention model for humanoid robots with gestures and verbal cues , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Hiroshi Ishiguro,et al.  Evaluation of Prosodic and Voice Quality Features on Automatic Extraction of Paralinguistic Information , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Satoshi Nakamura,et al.  ATR Parallel Decoding Based Speech Recognition System Robust to Noise and Speaking Styles , 2006, IEICE Trans. Inf. Syst..

[22]  Takayuki Kanda,et al.  Interactive Humanoid Robots for a Science Museum , 2007, IEEE Intell. Syst..

[23]  Speech and Language Databases for Speech Translation Research in ATR , Toshiyuki Takezawa , .