Efficient scalable encoding for distributed speech recognition

The problem of encoding speech features in the context of a distributed speech recognition system is addressed. Specifically, speech features are compressed using scalable encoding techniques to provide a multi-resolution bitstream. The use of this scalable encoding procedure is investigated in conjunction with a multi-pass distributed speech recognition (DSR) system. The multi-pass DSR system aims at progressive refinement in terms of recognition performance, (i.e., as additional bits are transmitted the recognition can be refined to improve the performance) and is shown to provide both bandwidth and complexity (latency) reductions. The proposed encoding schemes are well suited for implementation on light-weight mobile devices where varying ambient conditions and limited computational capabilities pose a severe constraint in achieving good recognition performance. The multi-pass DSR system is capable of adapting to varying network and system constraints by operating at an appropriate trade-off point between transmission rate, recognition performance and complexity to provide desired quality of service (QoS) to the user. The system was tested using two case studies. In the first, a distributed two-stage names recognition task, the scalable encoder operating at a bitrate of 4.6 kb/s achieved the same performance as that achieved using uncompressed features. In the second study, a two stage multi-pass continuous speech recognition task using HUB-4 data, the scalable encoder at a bitrate of 5.7 kb/s achieved the same performance as that achieved with uncompressed features. Reducing the bitrate to 4800 b/s resulted in a 1% relative increase in WER.

[1]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[2]  Mari Ostendorf,et al.  Graceful degradation of speech recognition performance over lossy packet networks , 2001, INTERSPEECH.

[3]  Antonio Ortega,et al.  Erasure recovery in predictive coding environments using multiple description coding , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[4]  Kenneth Rose,et al.  Toward optimality in scalable predictive coding , 2001, IEEE Trans. Image Process..

[5]  Hong Kook Kim,et al.  Bitstream-based feature extraction for wireless speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Imre Kiss,et al.  Robust feature vector compression algorithm for distributed speech recognition , 1999, EUROSPEECH.

[7]  Kuldip K. Paliwal,et al.  Effect of Speech Coders on Speech Recognition Performance , 1996, Fourth International Symposium on Signal Processing and Its Applications.

[8]  Imre Kiss A comparison of distributed and network speech recognition for mobile communication systems , 2000, INTERSPEECH.

[9]  Antonio Ortega,et al.  Efficient scalable speech compression for scalable speech recognition , 2001, INTERSPEECH.

[10]  Damjan Vlaj,et al.  A study of speech coding parameters in speech recognition , 2001, INTERSPEECH.

[11]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[12]  David L. Neuhoff,et al.  Quantization , 2022, IEEE Trans. Inf. Theory.

[13]  Kunio Nakajima,et al.  A two-stage speech recognition method with an error correction model , 1999, EUROSPEECH.

[14]  Richard M. Stern,et al.  Speech recognition in mobile environments , 2000 .

[15]  Tenkasi Ramabadran,et al.  Enhancing distributed speech recognition with back- end speech reconstruction , 2001, INTERSPEECH.

[16]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[17]  F. Bechet,et al.  Very large vocabulary proper name recognition for directory assistance , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[18]  Jay G. Wilpon,et al.  Discriminative feature selection for speech recognition , 1993, Comput. Speech Lang..

[19]  Marcello Federico,et al.  A two-stage speech recognition method for information retrieval applications , 1999, EUROSPEECH.

[20]  Antonio Ortega,et al.  Enhanced standard compliant distributed speech recognition (Aurora encoder) using rate allocation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Jean-Claude Junqua,et al.  SmarTspelLTM: a multipass recognition system for name retrieval over the telephone , 1997, IEEE Trans. Speech Audio Process..

[22]  Shrikanth Narayanan,et al.  Use of Model Transformations for Distributed Speech Recognition , 2001 .

[23]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Abeer Alwan,et al.  Source and channel coding for remote speech recognition over error-prone channels , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[25]  Abeer Alwan,et al.  Towards efficient and scalable speech compression schemes for robust speech recognition applications , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[26]  Abeer Alwan,et al.  An efficient and scalable 2D DCT-based feature coding scheme for remote speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[27]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[28]  Ben P. Milner,et al.  Robust speech recognition over IP networks , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[29]  C. Chrysafis,et al.  Efficient context-based entropy coding for lossy wavelet image compression , 1997, Proceedings DCC '97. Data Compression Conference.

[30]  Alexandros Potamianos,et al.  An error-protected speech recognition system for wireless communications , 2002, IEEE Trans. Wirel. Commun..

[31]  Philip A. Chou,et al.  Entropy-constrained vector quantization , 1989, IEEE Trans. Acoust. Speech Signal Process..

[32]  Guillermo Sapiro,et al.  The LOCO-I lossless image compression algorithm: principles and standardization into JPEG-LS , 2000, IEEE Trans. Image Process..

[33]  Vassilios Digalakis,et al.  Quantization of cepstral parameters for speech recognition over the World Wide Web , 1999, IEEE J. Sel. Areas Commun..

[34]  Laurent Besacier,et al.  Recovering of packet loss for Distributed Speech Recognition , 2002, 2002 11th European Signal Processing Conference.

[35]  Bhuvana Ramabhadran,et al.  Innovative approaches for large vocabulary name recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Antonio Ortega,et al.  Towards optimal encoding for classification with applications to distributed speech recognition , 2003, INTERSPEECH.

[37]  William Equitz,et al.  Successive refinement of information , 1991, IEEE Trans. Inf. Theory.

[38]  Ponani S. Gopalakrishnan,et al.  Compression of acoustic features for speech recognition in network environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).