Speech recognition in mobile environments

The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Classic robustness techniques that have been previously proposed for speech recognition yield limited improvements of the degradation introduced by idiosyncrasies of the mobile networks. These sources of degradation include distortion introduced by the speech codec as well as artifacts arising from channel errors and discontinuous transmission. In this thesis we focus on characterizing the distortion introduced to the speech signal by the speech codec and we propose methods for reducing the detrimental effect of coding on recognition accuracy. The initial focus of this thesis is on the full rate GSM codec (FR-GSM). We propose a method to generate recognition features directly from codec parameters. It is shown in this work that by selectively constructing a cepstral feature vector from the GSM codec parameters it is possible to reduce the effect of coding on recognition. The later parts of this work are related to weighted acoustic modeling for robust speech recognition. The motivation for this approach is based on the observation that not all phones in a GSM-coded corpus are distorted to the same extent due to coding. We first establish a set of phonetic distortion classes through an analysis of the distribution of the log spectral distortion introduced to each phone by the codec. These classes are then employed to estimate an optimal weighted combination of acoustic models according to the average distortion encountered by the class. A relative reduction of almost 70% of the degradation introduced by the GSM codec was achieved using this method. The technique of weighted acoustic modeling based on instantaneous distortion is introduced as an alternative to the method based on average distortion information. When the extent of cepstral distortion introduced by coding is known, weighted acoustic modeling provides a reduction of about 50% in the word error rate introduced by concurrent GSM and CELP. We propose two methods to estimate the instantaneous distortion information: one based on recoding sensitivity and another based on long-term predictability. Due to the non linear relation between the time and the log-spectral domain, the proposed estimates of the instantaneous distortion do not perform as well as algorithms based on knowledge of cepstral distortion. However, we show that employing the proposed instantaneous distortion information estimates can help obtain the best recognition results established in the baseline conditions employing only 50% of the baseline Gaussian density computations.

[1]  George Calhoun,et al.  Wireless Access and the Local Telephone Network , 1992 .

[2]  Andreas Spanias Speech coding standards , 2001 .

[3]  Vassilios Digalakis,et al.  Quantization of cepstral parameters for speech recognition over the World Wide Web , 1999, IEEE J. Sel. Areas Commun..

[4]  Philip Lockwood,et al.  Evaluation of root-normalised front-end (RN LFCC) for speech recognition in wireless GSM network environments , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Yoshihiko Akaiwa,et al.  Introduction to digital mobile communication , 1997, Wiley series in telecommunicatins and signal processing.

[6]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[7]  Joseph P. Campbell,et al.  The Dod 4.8 Kbps Standard (Proposed Federal Standard 1016) , 1991 .

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  S.K. Gupta,et al.  High-accuracy connected digit recognition for mobile applications , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Chafic Mokbel,et al.  Deconvolution of telephone line effects for speech recognition , 1996, Speech Commun..

[11]  Allen Gersho,et al.  Concepts and Paradigms in Speech Coding , 1995 .

[12]  Jean-Claude Junqua,et al.  Robustness improvements in continuously spelled names over the telephone , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[14]  Chafic Mokbel,et al.  Solutions for robust recognition over the GSM cellular network , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Régine André-Obrecht,et al.  Cellular phone speech recognition: noise compensation vs. robust architectures , 1997, EUROSPEECH.

[16]  Imre Kiss On speech recognition in mobile communications , 2001 .

[17]  Richard M. Stern,et al.  Speech recognition from GSM codec parameters , 1998, ICSLP.

[18]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[19]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[20]  Ponani S. Gopalakrishnan,et al.  Compression of acoustic features for speech recognition in network environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21]  Mei-Yuh Hwang,et al.  Deleted interpolation and density sharing for continuous hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[22]  M. W. Oliphant The mobile phone meets the Internet , 1999 .

[23]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[24]  Vassilios Digalakis,et al.  Robust speech recognition for multiple topological scenarios of the GSM mobile phone system , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  Cheng Wu,et al.  Towards robust speech recognition in the telephony network environment - cellular and landline conditions , 1999, EUROSPEECH.

[26]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Juan Carlos Torrecilla,et al.  Name dialing using final user defined vocabularies in mobile (GSM and TACS) and fixed telephone networks , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Darryl Stewart,et al.  Improving speech recognition performance by using multi-model approaches , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[29]  K. Paliwal,et al.  Quantization of LPC Parameters , 2022 .

[30]  Martin Paping,et al.  Automatic detection of disturbing robot voice- and ping pong-effects in GSM transmitted speech , 1997, EUROSPEECH.

[31]  Lou Boves,et al.  Channel normalization techniques for automatic speech recognition over the telephone , 1998, Speech Commun..

[32]  Richard M. Stern,et al.  DISTORTION-CLASS WEIGHTED ACOUSTIC MODELING FOR ROBUST SPEECH RECOGNITION UNDER GSM RPE-LTP CODING , 1999 .

[33]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[34]  Steve Aftelak New Speech Related Features in GSM , 2002 .

[35]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[36]  Roger C. F. Tucker,et al.  Compression of acoustic features - are perceptual quality and recognition performance incompatible goals? , 1999, EUROSPEECH.

[37]  Ed F. Deprettere,et al.  Regular-pulse excitation-A novel approach to effective and efficient multipulse coding of speech , 1986, IEEE Trans. Acoust. Speech Signal Process..

[38]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39]  I. Trancoso,et al.  An Overview of Different Trends on CELP Coding , 1995 .

[40]  Antonio José Rubio Ayuso,et al.  Speech Recognition and Coding: New Advances and Trends , 1995 .

[41]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[42]  Francisco J. Valverde-Albacete,et al.  Recognition from GSM digital speech , 1998, ICSLP.

[43]  Kuldip K. Paliwal,et al.  An Introduction to Speech Coding , 1995 .

[44]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[45]  Roger K. Moore Computer Speech and Language , 1986 .

[46]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[47]  Reinhold Häb-Umbach Robust speech recognition for wireless networks and mobile telephony , 1997, EUROSPEECH.

[48]  Hynek Hermansky,et al.  Towards subband-based speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[49]  Chafic Mokbel,et al.  Adapting PSN recognition models to the GSM environment by using spectral transformation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Kuldip K. Paliwal,et al.  Effect of Speech Coders on Speech Recognition Performance , 1996, Fourth International Symposium on Signal Processing and Its Applications.

[51]  Jean-Claude Junqua,et al.  Spectral Dynamics for Speech Recognition Under Adverse Conditions , 1996 .

[52]  李幼升,et al.  Ph , 1989 .

[53]  C. Mokbel,et al.  Frame-synchronous adaptation of cepstrum by linear regression , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[54]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[55]  D. Jouvet,et al.  Towards improving ASR robustness for PSN & GSM telephone applications , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[56]  Francisco J. Valverde-Albacete,et al.  Avoiding distortions due to speech coding and transmission errors in GSM ASR tasks , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[57]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[58]  Stephan Euler,et al.  The influence of speech coding algorithms on automatic speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  H. Hermansky,et al.  Noise suppression in cellular communications , 1994, Proceedings of 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications.

[60]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[61]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[62]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[63]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[64]  Thomas E. Tremain,et al.  An evaluation of 4800 bps voice coders. , 1989 .

[65]  Karl Hellwig,et al.  A regular-pulse excited linear predictive codec , 1988, Speech Commun..

[66]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[67]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[68]  Mei Hwang Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition , 2001 .

[69]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Mazin G. Rahim,et al.  Integrated bias removal techniques for robust speech recognition , 1999, Comput. Speech Lang..

[71]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[72]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.