On Using 2D Sequence-to-sequence Models for Speech Recognition

Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture to directly model the input/output relation between audio/feature vector sequences and word sequences. The proposed model is an alternative model such that instead of using any type of attention components, we apply a 2DLSTM layer to assimilate the context from both input observations and output transcriptions. The experimental evaluation on the Switchboard 300h automatic speech recognition task shows word error rates for the 2DLSTM model that are competitive to end-to-end attention-based model.

[1]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[4]  Jürgen Schmidhuber,et al.  Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[5]  Geoffrey Zweig,et al.  Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tobias Grüning,et al.  Cells in Multidimensional Recurrent Neural Networks , 2016, J. Mach. Learn. Res..

[7]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[8]  Laurent Besacier,et al.  Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction , 2018, CoNLL.

[9]  Liang Lu,et al.  Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition , 2017, INTERSPEECH.

[10]  Hermann Ney,et al.  RASR/NN: The RWTH neural network toolkit for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hermann Ney,et al.  Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation , 2018, EMNLP.

[12]  Tara N. Sainath,et al.  Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition , 2017, INTERSPEECH.

[13]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[15]  Hermann Ney,et al.  Handwriting Recognition with Large Multidimensional Long Short-Term Memory Recurrent Neural Networks , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Hans-Peter Hutter,et al.  Comparison of a new hybrid connectionist-SCHMM approach with other hybrid approaches for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Tobias Grüning,et al.  CITlab ARGUS for historical handwritten documents , 2016, ArXiv.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[24]  Hermann Ney,et al.  RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition , 2018, ACL.

[25]  Heiga Zen,et al.  Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents , 2018, INTERSPEECH.

[26]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Hermann Ney,et al.  Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..