Tue-SeA Real-Time Speech Command Detector for a Smart Control Room

In this work we present an always-on speech recognition system that discriminates spoken commands directed to the system from other spoken input. For discrimination we integrated various features ranging from prosodic cues and decoding features to linguistic information. The resulting ”Speech Command Detector” provides intuitive hands-free user interaction in a Smart Control Room environment where voice commands are directed toward a large interactive display. Based on a recognition vocabulary of 259 words with more than 10k possible commands, the Speech Command Detector detected 88.3% of the commands correctly maintaining a very low False Positive Rate of 1.5%. In a cross-domain setup the system was evaluated on a Star Trek episode. With only minor adjustments, our system achieved very promising results with 91.2% command detection rate at a False Positive Rate of 1.8%. Index Terms: always-on spoken command detection, smart environment, prosodic and confidence-based features

[1]  Tetsuya Takiguchi,et al.  System request detection in human conversation based on multi-resolution Gabor wavelet features , 2009, INTERSPEECH.

[2]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? , 1998, Language and speech.

[3]  Masahito Togami,et al.  Intentional voice command detection for completely hands-free speech interface in home environments , 2008, INTERSPEECH.

[4]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[5]  Tanja Schultz,et al.  Identifying the addressee in human-human-robot interactions based on head pose and speech , 2004, ICMI '04.

[6]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[7]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[8]  John-Paul Hosom,et al.  Classifying clear and conversational speech based on acoustic features , 2009, INTERSPEECH.

[9]  Kornel Laskowski,et al.  Modeling instantaneous intonation for speaker identification using the fundamental frequency variation spectrum , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Tetsuya Takiguchi,et al.  System request detection in conversation based on acoustic and speaker alternation features , 2007, INTERSPEECH.

[11]  Chin-Hui Lee,et al.  Speaking-style dependent lexicalized filler model for key-phrase detection and verification , 1997, ICSLP.

[12]  Joris IJsselmuiden,et al.  Towards High-Level Human Activity Recognition through Computer Vision and Temporal Logic , 2010, KI.