Efficient client-server based implementations of mobile speech recognition services

Abstract The purpose of this paper is to demonstrate the efficiencies that can be achieved when automatic speech recognition (ASR) applications are provided to large user populations using client–server implementations of interactive voice services. It is shown that, through proper design of a client–server framework, excellent overall system performance can be obtained with minimal demands on the computing resources that are allocated to ASR. System performance is considered in the paper in terms of both ASR speed and accuracy in multi-user scenarios. An ASR resource allocation strategy is presented that maintains sub-second average speech recognition response latencies observed by users even as the number of concurrent users exceeds the available number of ASR servers by more than an order of magnitude. An architecture for unsupervised estimation of user-specific feature space adaptation and normalization algorithms is also described and evaluated. Significant reductions in ASR word error rate were obtained by applying these techniques to utterances collected from users of hand-held mobile devices. These results are important because, while there is a large body of work addressing the speed and accuracy of individual ASR decoders, there has been very little effort applied to dealing with the same issues when a large number of ASR decoders are used in multi-user scenarios.

[1]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[4]  Aaron E. Rosenberg,et al.  On the implementation of ASR algorithms for hand-held wireless mobile devices , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Hermann Ney,et al.  Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition , 1997, EUROSPEECH.

[6]  Peter Druschel,et al.  A Scalable and Explicit Event Delivery Mechanism for UNIX , 1999, USENIX Annual Technical Conference, General Track.

[7]  Cao Yang,et al.  Review of AMR speech codec-and distributed speech recognition-based speech-enabled services , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[8]  Franz Kummert,et al.  Dynamic search-space pruning for time-constrained speech recognition , 2002, INTERSPEECH.

[9]  Susan J. Eggers,et al.  Improving server software support for simultaneous multithreaded processors , 2003, PPoPP '03.

[10]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[11]  Christophe Beaugeant,et al.  Network-based vs. distributed speech recognition in adaptive multi-rate wireless systems , 2002, INTERSPEECH.

[12]  Alexandros Potamianos,et al.  Soft-feature decoding for speech recognition over wireless channels , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Willy Zwaenepoel,et al.  Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[14]  Hong Kook Kim,et al.  Performance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments , 2002, IEEE Trans. Speech Audio Process..

[15]  Richard C. Rose,et al.  An efficient framework for robust mobile speech recognition services , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Abhishek Chandra,et al.  Scalability of Linux Event-Dispatch Mechanisms , 2000, USENIX Annual Technical Conference, General Track.

[17]  O. Viikki,et al.  ASR in portable wireless devices , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[18]  Brian Kan-Wing Mak,et al.  Subspace distribution clustering hidden Markov model , 2001, IEEE Trans. Speech Audio Process..

[19]  Abeer Alwan,et al.  Joint channel decoding - Viterbi recognition for wireless applications , 2001, INTERSPEECH.

[20]  Carmen García-Mateo,et al.  Soft decoding strategies for distributed speech recognition over IP networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[22]  Rathinavelu Chengalvarayan,et al.  Unified speech recognition for the landline and wireless environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Paul Dalsgaard,et al.  On the integration of speech recognition into personal networks , 2004, INTERSPEECH.

[24]  William J. Byrne,et al.  Robust estimation for rapid speaker adaptation using discounted likelihood techniques , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25]  Mehryar Mohri,et al.  Network optimizations for large-vocabulary speech recognition , 1999, Speech Commun..