论文信息 - Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition

Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition

Neural architecture search (NAS) has been successfully applied to finding efficient, high-performance deep neural network architectures in a task-adaptive manner without extensive human intervention. This is achieved by choosing genetic, reinforcement learning, or gradient -based algorithms as automative alternatives of manual architecture design. However, a naive application of existing NAS algorithms to different tasks may result in architectures which perform sub-par to those manually designed. In this work, we show that NAS can provide efficient architectures that outperform maually designed attention -based arhitectures on speech recognition tasks, after which we named Evolved Speech-Transformer (EST). With a combination of carefully designed search space and Progressive dynamic hurdles, a genetic algorithm based, our algorithm finds a memory-efficient architecture which outperforms vanilla Transformer with reduced training time.

[1] Yu Zhang,et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[2] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[3] Alok Aggarwal,et al. Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[4] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[6] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[7] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[8] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[9] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[11] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[12] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Quoc V. Le,et al. The Evolved Transformer , 2019, ICML.

[14] Brian Kingsbury,et al. Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[17] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Mikko Kurimo,et al. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[21] Shuang Xu,et al. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).