RoboASR: A Dynamic Speech Recognition System for Service Robots

This paper proposes a new method for building dynamic speech decoding graphs for state based spoken human-robot interaction (HRI). The current robotic speech recognition systems are based on either finite state grammar (FSG) or statistical N-gram models or a dual FSG and N-gram using a multi-pass decoding. The proposed method is based on merging both FSG and N-gram into a single decoding graph by converting the FSG rules into a weighted finite state acceptor (WFSA) then composing it with a large N-gram based weighted finite state transducer (WFST). This results in a tiny decoding graph that can be used in a single pass decoding. The proposed method is applied in our speech recognition system (RoboASR) for controlling service robots with limited resources. There are three advantages of the proposed approach. First, it takes the advantage of both FSG and N-gram decoders by composing both of them into a single tiny decoding graph. Second, it is robust, the resulting tiny decoding graph is highly accurate due to it fitness to the HRI state. Third, it has a fast response time in comparison to the current state of the art speech recognition systems. The proposed system has a large vocabulary containing 64K words with more than 69K entries. Experimental results show that the average response time is 0.05% of the utterance length and the average ratio between the true and false positives is 89% when tested on 15 interaction scenarios using live speech.

[1]  Steve Young,et al.  The HTK book , 1995 .

[2]  Kuldip K. Paliwal,et al.  Robust speech recognition under noisy ambient conditions , 2010, AmI 2010.

[3]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[4]  Takayuki Kanda,et al.  A Communication Robot in a Shopping Mall , 2010, IEEE Transactions on Robotics.

[5]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[6]  Juan Carlos Augusto,et al.  Human-Centric Interfaces for Ambient Intelligence , 2009 .

[7]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[8]  Michael Picheny,et al.  Key-phrase spotting using an integrated language model of n-grams and finite-state grammar , 1997, EUROSPEECH.

[9]  Bruce A. MacDonald,et al.  Human-Robot Interaction Research to Improve Quality – An Approach and Issues , 2011 .

[10]  Waleed H. Abdulla,et al.  WFST-based Large Vocabulary Continuous Speech Decoder for Service Robots , 2012 .

[11]  Gerhard Lakemeyer,et al.  A Robust Speech Recognition System for Service-Robotics Applications , 2008, RoboCup.

[12]  Georg Heigold,et al.  The RWTH aachen university open source speech recognition system , 2009, INTERSPEECH.

[13]  Fernando Alonso-Martín,et al.  INTEGRATION OF A VOICE RECOGNITION SYSTEM IN A SOCIAL ROBOT , 2011, Cybern. Syst..

[14]  Johan Schalkwyk,et al.  A generalized composition algorithm for weighted finite-state transducers , 2009, INTERSPEECH.

[15]  Stefan Wermter,et al.  Towards Robust Speech Recognition for Human-Robot Interaction , 2011 .

[16]  Michael Levit,et al.  Garbage modeling with decoys for a sequential recognition scenario , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  H.,et al.  Token Passing : a Simple Conceptual Model for ConnectedSpeech Recognition , 1989 .

[18]  Keikichi Hirose,et al.  Painless WFST Cascade Construction for LVCSR - Transducersaurus , 2011, INTERSPEECH.