Expectation-Based Command Recognition Off the Shelf: Publicly Reproducible Experiments with Speech Input

When striving for a cheap implementation of command recognition for speech input today, you may resort to off-the-shelf tools. In contrast to specific research approaches, such tools by themselves do not take expectations for certain commands in a given situation into account. Such expectations will usually be available both in "intelligent" and more conventional programs using this speech interface, and they should be used to improve command recognition. We propose to make use of a set of expected commands at a given state of a dialogue and a list of ranked command hypotheses from basic speech recognition. We devised and implemented this with two approaches for speech input. One specializes a given grammar according to expected commands at each dialogue state, the other accepts the highest-ranked hypothesis for a command that fits the expected ones at a given state. The latter approach achieved a statistically significant improvement of the command success rate in an experiment, as compared to ignoring the expectations. Since everything is freely available, we made these experiments publicly reproducible.

[1]  Jason D. Williams Exploiting the ASR n-best by tracking multiple dialog state hypotheses , 2008, INTERSPEECH.

[2]  Hermann Kaindl,et al.  Semi-automatic generation of multimodal user interfaces for dialogue-based interactive systems , 2012, ICMI '12.

[3]  Hermann Kaindl,et al.  Automatic Generation of the Behavior of a User Interface from a High-Level Discourse Model , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[4]  Wayne H. Ward,et al.  Dialog-context dependent language modeling combining n-grams and stochastic context-free grammars , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Oliver Lemon,et al.  Combining Acoustic and Pragmatic Features to Predict Recognition Performance in Spoken Dialogue Systems , 2004, ACL.

[6]  Fabio Paternò,et al.  Deriving Presentations from Task Models , 1998, EHCI.

[7]  David S. Pallett,et al.  Speech recognition performance assessments and available databases , 1983, ICASSP.

[8]  Rainer Wasinger,et al.  Multimodal interaction with mobile devices: fusing a broad spectrum of modality combinations , 2007 .

[9]  Deb Roy,et al.  Towards situated speech understanding: visual context priming of language models , 2005, Comput. Speech Lang..

[10]  George A. Miller,et al.  Decision units in the perception of speech , 1962, IRE Trans. Inf. Theory.

[11]  Yang Li,et al.  Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes , 2007, UIST.

[12]  Jelena Kovacevic,et al.  Reproducible research in signal processing , 2009, IEEE Signal Process. Mag..

[13]  Peter Gärdenfors The Role of Expectations in Reasoning , 1992, Logic at Work.

[14]  Dong Yu,et al.  N-Gram Based Filler Model for Robust Grammar Authoring , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Steve J. Young,et al.  Bootstrapping language models for dialogue systems , 2006, INTERSPEECH.

[16]  Sheryl R. Young,et al.  The MINDS System: Using Context and Dialog to Enhance Speech Recognition , 1989, HLT.

[17]  Michael Rovatsos,et al.  Bounded practical social reasoning in the ESB framework , 2009, AAMAS.

[18]  Alexander H. Waibel,et al.  Tight coupling of speech recognition and dialog management - dialog-context dependent grammar weighting for speech recognition , 2004, INTERSPEECH.

[19]  Pierre Lison,et al.  Salience-driven Contextual Priming of Speech Recognition for Human-Robot Interaction , 2008, ECAI.