New Paradigm in Speech Recognition: Deep Neural Networks

This paper addresses the topic of deep neural networks (DNN). Recently, DNN has become a flagship in the fields of artificial intelligence. Deep learning has surpassed state-of-the-art results in many domains: image recognition, speech recognition, language modelling, parsing, information retrieval, speech synthesis, translation, autonomous cars, gaming, etc. DNN have the ability to discover and learn complex structure of very large data sets. Moreover, DNN have a great capability of generalization. More specifically, speech recognition with DNN is the topic of our work in this paper. We present an overview of different architectures and training procedures for DNN-based models. In the framework of transcription of broadcast news, our DNN-based system decreases the word error rate dramatically compared to a classical system.

[1]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[2]  LiXiao,et al.  Machine Learning Paradigms for Speech Recognition , 2013 .

[3]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[7]  Guillaume Gravier,et al.  The ester 2 evaluation campaign for the rich transcription of French radio broadcasts , 2009, INTERSPEECH.

[8]  James R. Glass,et al.  Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[12]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[18]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[19]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[20]  Martha Larson,et al.  Spoken Content Retrieval: A Survey of Techniques and Technologies , 2012, Found. Trends Inf. Retr..

[21]  Li Deng,et al.  A tutorial survey of architectures, algorithms, and applications for deep learning , 2014, APSIPA Transactions on Signal and Information Processing.

[22]  Hermann Ney,et al.  LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition , 2016, INTERSPEECH.

[23]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.