Thinking about the present and future of the complex speech recognition

A critical point of the most cognitive info-communication systems is the state of the development of speech recognition technology. The paper gives a short introduction of the principles of this speech recognition technology today. It highlights the fact that these systems in the market are only speech-to-text transformers giving only a word chain at the output, where the speech prosody, speech emotion, speech style and more other information are not involved. Many uncertainties exist in this operational system. Some up to date research tendencies, mostly the parallel processing are introduced aiming to increase the efficiencies of the recognition. At the end, research agenda of META NET are shortly introduced for Multilingual Europe in 2020.

[1]  Li Deng,et al.  Challenges in adopting speech recognition , 2004, CACM.

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[4]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[5]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Klára Vicsi,et al.  Recognition of Emotions on the Basis of Different Levels of Speech Segments , 2012, J. Adv. Comput. Intell. Intell. Informatics.

[7]  Sridha Sridharan,et al.  A unified approach to multi-pose audio-visual ASR , 2007, INTERSPEECH.

[8]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[9]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[10]  Klára Vicsi,et al.  Automatic Classification of Emotions in Spontaneous Speech , 2010, COST 2102 Conference.

[11]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Sridha Sridharan,et al.  Multi-Channel Sub-Band Speech Recognition , 2001, EURASIP J. Adv. Signal Process..

[13]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  Jean-Philippe Thiran,et al.  Multipose audio-visual speech recognition , 2011, 2011 19th European Signal Processing Conference.

[15]  Czech Republic Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions, COST Action 2102 International Conference, Prague, Czech Republic, October 15-18, 2008, Revised Selected and Invited Papers , 2009, COST 2102 Conference.

[16]  James R. Glass,et al.  Updated Minds Report on Speech Recognition and Understanding, Part 2 Citation Baker, J. Et Al. " Updated Minds Report on Speech Recognition and Understanding, Part 2 [dsp Education]. " Signal Processing Accessed Terms of Use , 2022 .

[17]  Jean-Philippe Thiran,et al.  Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[18]  Roger K. Moore Progress and Prospects for Speech Technology: Results from Three Sexennial Surveys , 2011, INTERSPEECH.

[19]  Anna Esposito,et al.  Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions , 2009 .

[20]  Klára Vicsi,et al.  Problems of the Automatic Emotion Recognitions in Spontaneous Speech; An Example for the Recognition in a Dispatcher Center , 2010, COST 2102 Training School.

[21]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[22]  György Szaszák,et al.  Using prosody to improve automatic speech recognition , 2010, Speech Commun..

[23]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[24]  Antinus Nijholt,et al.  Preface (to: Analysis of Verbal and Nonverbal Communication and Enactment: The Procesing Issues) , 2011 .

[25]  Lijiang Chen,et al.  Speech emotion recognition: Features and classification models , 2012, Digit. Signal Process..

[26]  Fabio Valente Multi-stream speech recognition based on Dempster-Shafer combination rule , 2010, Speech Commun..

[27]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[28]  Hynek Hermansky,et al.  Toward optimizing stream fusion in multistream recognition of speech. , 2011, The Journal of the Acoustical Society of America.

[29]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..