Automatic speech recognition over error-prone wireless networks

Abstract The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classified into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection. This paper concludes with a discussion of applicability of different techniques based on the channel characteristics and the system requirements.

[1]  O. Viikki,et al.  ASR in portable wireless devices , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  E. Gilbert Capacity of a burst-noise channel , 1960 .

[3]  John B. Anderson,et al.  Source and Channel Coding , 1991 .

[4]  Angel Manuel Gomez,et al.  A comparison of packet loss compensation methods and interleaving for speech recognition in burst-like packet loss , 2004, INTERSPEECH.

[5]  Paul Dalsgaard,et al.  Partial splicing packet loss concealment for distributed speech recognition , 2003 .

[6]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[7]  Kuldip K. Paliwal,et al.  Effect of Speech Coders on Speech Recognition Performance , 1996, Fourth International Symposium on Signal Processing and Its Applications.

[8]  Alexandros Potamianos,et al.  Speech recognition for wireless applications , 2001, ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).

[9]  Paul Dalsgaard,et al.  A SUBVECTOR-BASED ERROR CONCEALMENT ALGORITHM FOR SPEECH RECOGNITION OVER MOBILE NETWORKS , 2004 .

[10]  Antonio Ortega,et al.  Efficient scalable speech compression for scalable speech recognition , 2001, INTERSPEECH.

[11]  David Pearce,et al.  The ETSI extended distributed speech recognition (DSR) standards: server-side speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[13]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[14]  David Pearce,et al.  Speech recognition performance comparison between DSR and AMR transcoded speech , 2002, INTERSPEECH.

[15]  Ben P. Milner,et al.  An analysis of interleavers for robust speech recognition in burst-like packet loss , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Richard M. Stern,et al.  Reconstruction of incomplete spectrograms for robust speech recognition , 2000 .

[17]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[18]  V. Hardman,et al.  A survey of packet loss recovery techniques for streaming audio , 1998, IEEE Network.

[19]  Henning Schulzrinne,et al.  RTP: A Transport Protocol for Real-Time Applications , 1996, RFC.

[20]  Carmen Peláez-Moreno,et al.  Recognizing voice over IP: a robust front-end for speech recognition on the world wide web , 2001, IEEE Trans. Multim..

[21]  Vivek K. Goyal,et al.  Multiple description coding: compression meets the network , 2001, IEEE Signal Process. Mag..

[22]  Fan Wang,et al.  The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[24]  John L. Ramsey Realization of optimum interleavers , 1970, IEEE Trans. Inf. Theory.

[25]  Shivendra S. Panwar,et al.  Wireless video transport using path diversity: multiple description vs layered coding , 2002, Proceedings. International Conference on Image Processing.

[26]  Stephan Euler,et al.  The influence of speech coding algorithms on automatic speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  B. Sklar,et al.  The ABCs of linear block codes , 2004, IEEE Signal Processing Magazine.

[28]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[29]  Paul Dalsgaard,et al.  Channel error protection scheme for distributed speech recognition , 2002, INTERSPEECH.

[30]  Lin-Shan Lee,et al.  Efficient and robust distributed speech recognition (DSR) over wireless fading channels: 2D-DCT compression, iterative bit allocation, short BCH code and interleaving , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Antonio Rubio,et al.  Statistical-based reconstruction methods for speech recognition in IP networks , 2004 .

[32]  Hong Kook Kim,et al.  A bitstream-based front-end for wireless speech recognition on IS-136 communications system , 2001, IEEE Trans. Speech Audio Process..

[33]  Yu-Chi Ho The no free lunch theorem and the human-machine interface , 1999 .

[34]  Los Angeles,et al.  Source and Channel Coding for Speech Transmission and Remote Speech Recognition , 2002 .

[35]  Mervyn A. Jack,et al.  Weighted Viterbi algorithm and state duration modelling for speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[36]  Imre Kiss A comparison of distributed and network speech recognition for mobile communication systems , 2000, INTERSPEECH.

[37]  Carmen García-Mateo,et al.  Soft decoding strategies for distributed speech recognition over IP networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Mark A. Clements,et al.  Multiple description coding for recognizing voice over IP , 2002, Proceedings of 2002 IEEE 10th Digital Signal Processing Workshop, 2002 and the 2nd Signal Processing Education Workshop..

[39]  Alexandros Potamianos,et al.  An error-protected speech recognition system for wireless communications , 2002, IEEE Trans. Wirel. Commun..

[40]  Richard C. Rose,et al.  An efficient framework for robust mobile speech recognition services , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[41]  Richard M. Stern,et al.  Speech recognition from GSM codec parameters , 1998, ICSLP.

[42]  Paul Dalsgaard,et al.  On the integration of speech recognition into personal networks , 2004, INTERSPEECH.

[43]  Dominique Vaufreydaz,et al.  The effect of speech and audio compression on speech recognition performance , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[44]  Biing-Hwang Juang,et al.  Why speech synthesis? (in memory of Prof. Jonathan Allen, 1934-2000) , 2001, IEEE Transactions on Speech and Audio Processing.

[45]  Abeer Alwan,et al.  An efficient and scalable 2D DCT-based feature coding scheme for remote speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[46]  Georg Carle,et al.  Survey of error recovery techniques for IP-based audio-visual multicast applications , 1997, IEEE Netw..

[47]  Mari Ostendorf,et al.  Graceful degradation of speech recognition performance over lossy packet networks , 2001, INTERSPEECH.

[48]  Lin-Shan Lee,et al.  Voice access of global information for broad-band wireless: technologies of today and challenges of tomorrow , 2001, Proc. IEEE.

[49]  Narada D. Warakagoda,et al.  A Noise Robust Multilingual Reference Recogniser Based on Speechdat(II) , 2000, INTERSPEECH.

[50]  A.R.K. Sastry,et al.  Models for channels with memory and their applications to error control , 1978, Proceedings of the IEEE.

[51]  Antonio Ortega,et al.  Enhanced standard compliant distributed speech recognition (Aurora encoder) using rate allocation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  David Pearce,et al.  RTP Payload Formats for European Telecommunications Standards Institute (ETSI) European Standard ES 202 050, ES 202 211, and ES 202 212 Distributed Speech Recognition Encoding , 2005, RFC.

[53]  J. Schroeter,et al.  Speech and language processing for next-millennium communications services , 2000, Proceedings of the IEEE.

[54]  B. Milner Robust speech recognition in burst-like packet loss , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[55]  Abeer Alwan,et al.  Source and channel coding for remote speech recognition over error-prone channels , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[56]  Peter Vary,et al.  Softbit speech decoding: a new approach to error concealment , 2001, IEEE Trans. Speech Audio Process..

[57]  Eduardo Lleida,et al.  Utterance verification in continuous speech recognition: decoding and training procedures , 2000, IEEE Trans. Speech Audio Process..

[58]  José L. Pérez-Córdoba,et al.  MMSE-based channel error mitigation for distributed speech recognition , 2001, INTERSPEECH.

[59]  W. Bastiaan Kleijn,et al.  Comparison of transmitter - based packet-loss recovery techniques for voice transmission , 2004, INTERSPEECH.

[60]  Kuldip K. Paliwal,et al.  Scalable distributed speech recognition using multi-frame GMM-based block quantization , 2004, INTERSPEECH.

[61]  Paul Dalsgaard,et al.  A Comparative Study of Feature-Domain Error Concealment Techniques for Distributed Speech Recognition , 2004 .

[62]  Ben P. Milner,et al.  Robust speech recognition over IP networks , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[63]  Christophe Beaugeant,et al.  Network-based vs. distributed speech recognition in adaptive multi-rate wireless systems , 2002, INTERSPEECH.

[64]  Alexandros Potamianos,et al.  Soft-feature decoding for speech recognition over wireless channels , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[65]  Ben P. Milner,et al.  An analysis of packet loss models for distributed speech recognition , 2004, INTERSPEECH.

[66]  Paul Dalsgaard,et al.  OOV-detection and channel error protection for distributed speech recognition over wireless networks , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[67]  Ángel M. Gómez,et al.  A source model mitigation technique for distributed speech recognition over lossy packet channels , 2003, INTERSPEECH.

[68]  Satoshi Nakamura,et al.  Missing Feature Theory Applied to Robust Speech Recognition over IP Network , 2003, IEICE Trans. Inf. Syst..

[69]  Mark Hasegawa-Johnson,et al.  PLP coefficients can be quantized at 400 bps , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[70]  Yao Wang,et al.  Error control and concealment for video communication: a review , 1998, Proc. IEEE.

[71]  Mari Ostendorf,et al.  Graceful degradation of speech recognition performance over packet-erasure networks , 2002, IEEE Trans. Speech Audio Process..

[72]  Vassilios Digalakis,et al.  Quantization of cepstral parameters for speech recognition over the World Wide Web , 1999, IEEE J. Sel. Areas Commun..

[73]  José L. Pérez-Córdoba,et al.  HMM-based channel error mitigation and its application to distributed speech recognition , 2003, Speech Commun..

[74]  Martin Bossert,et al.  Channel Coding for Telecommunications , 1999 .

[75]  Abeer Alwan,et al.  Low-bitrate distributed speech recognition for packet-based and wireless communication , 2002, IEEE Trans. Speech Audio Process..

[76]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[77]  Reinhold Häb-Umbach,et al.  Soft features for improved distributed speech recognition over wireless networks , 2004, INTERSPEECH.

[78]  Jean-François Serignat,et al.  Audio packet loss over IP and speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[79]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[80]  Hong Kook Kim,et al.  Bitstream-based feature extraction for wireless speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[81]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[82]  Juan Manuel Huerta Robust Speech Recognition in GSM Codec Environments , 1998 .

[83]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[84]  Rathinavelu Chengalvarayan,et al.  Unified speech recognition for the landline and wireless environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[85]  José L. Pérez-Córdoba,et al.  Efficient MMSE-based channel error mitigation techniques. Application to distributed speech recognition over wireless channels , 2005, IEEE Transactions on Wireless Communications.

[86]  Xu Shao,et al.  Low bit-rate feature vector compression using transform coding and non-uniform bit allocation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[87]  Qiru Zhou,et al.  On a n-gram model approach for packet loss concealment , 2004, INTERSPEECH.

[88]  David Pearce,et al.  Robustness to Transmission Channel – the DSR Approach , 2004 .