A flat direct model for speech recognition

We introduce a direct model for speech recognition that assumes an unstructured, i.e., flat text output. The flat model allows us to model arbitrary attributes and dependences of the output. This is different from the HMMs typically used for speech recognition. This conventional modeling approach is based on sequential data and makes rigid assumptions on the dependences. HMMs have proven to be convenient and appropriate for large vocabulary continuous speech recognition. Our task under consideration, however, is the Windows Live Search for Mobile (WLS4M) task [1]. This is a cellphone application that allows users to interact with web-based information portals. In particular, the set of valid outputs can be considered discrete and finite (although probably large, i.e., unseen events are an issue). Hence, a flat direct model lends itself to this task, making the adding of different knowledge sources and dependences straightforward and cheap. Using e.g. HMM posterior, m-gram, and spotter features, significant improvements over the conventional HMM system were observed.

[1]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[2]  Geoffrey Zweig,et al.  Language modeling for voice search: A machine translation approach , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[4]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[5]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[6]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  S. Axelrod,et al.  Combination of hidden Markov models with dynamic time warping for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Chin-Hui Lee,et al.  A study on word detector design and knowledge-based pruning and rescoring , 2007, INTERSPEECH.

[9]  Geoffrey Zweig,et al.  Live search for mobile:Web services by voice on the cellphone , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.