A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

Transformer with self-attention has achieved great success in the area of nature language processing. Recently, there have been a few studies on transformer for end-to-end speech recognition, while its application for hybrid acoustic model is still very limited. In this paper, we revisit the transformer-based hybrid acoustic model, and propose a model structure with interleaved self-attention and 1D convolution, which is proven to have faster convergence and higher recognition accuracy. We also study several aspects of the transformer model, including the impact of the positional encoding feature, dropout regularization, as well as training with and without time restriction. We show competitive recognition results on the public Librispeech dataset when compared to the Kaldi baseline at both cross entropy training and sequence training stages. For reproducible research, we release our source code and recipe within the PyKaldi2 toolbox.

[1]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[3]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[4]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[5]  Ke Li,et al.  A Time-Restricted Self-Attention Layer for ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[7]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Zhiheng Huang,et al.  FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION , 2019 .

[10]  Kyu J. Han,et al.  Multi-Stride Self-Attention for Speech Recognition , 2019, INTERSPEECH.

[11]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[12]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yiming Yang,et al.  Transformer-XL: Language Modeling with Longer-Term Dependency , 2018 .

[14]  Steve Renals,et al.  A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[15]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16]  Shrikanth S. Narayanan,et al.  Pykaldi: A Python Wrapper for Kaldi , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Liang Lu,et al.  PyKaldi2: Yet another speech toolkit based on Kaldi and PyTorch , 2019, ArXiv.

[20]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Kyu J. Han,et al.  State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Jiangyan Yi,et al.  Self-Attention Transducers for End-to-End Speech Recognition , 2019, INTERSPEECH.

[24]  Matthias Sperber,et al.  Self-Attentional Acoustic Models , 2018, INTERSPEECH.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.