Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Neural architecture search (NAS) has been successfully applied to tasks like image classification and language modeling for finding efficient high-performance network architectures. In ASR field especially end-to-end ASR, the related research is still in its infancy. In this work, we focus on applying NAS on the most popular manually designed model: Conformer, and propose an efficient ASR model searching method that benefits from the natural advantage of differentiable architecture search (Darts) in reducing computational overheads. We fuse Darts mutator and Conformer blocks to form a complete search space, within which a modified architecture called Darts-Conformer cell is found automatically. The entire searching process on AISHELL-1 dataset costs only 0.7 GPU days. Replacing the Conformer encoder by stacking searched architecture, we get an end-to-end ASR model (named as Darts-Conformner) that outperforms the Conformer baseline by 4.7% relatively on the open-source AISHELL-1 dataset. Besides, we verify the transferability of the architecture searched on a small dataset to a larger 2k-hour dataset.

[1]  Song Han,et al.  ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware , 2018, ICLR.

[2]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Zhijian Ou,et al.  CRF-based Single-stage Acoustic Modeling with CTC Topology , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[6]  Di He,et al.  Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View , 2019, ArXiv.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[10]  Samin Ishtiaq,et al.  NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition , 2021, ICLR.

[11]  Jisung Wang,et al.  Evolved Speech-Transformer: Applying Neural Architecture Search to End-to-End Automatic Speech Recognition , 2020, INTERSPEECH.

[12]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[13]  Ludovic Denoyer,et al.  Stochastic Adaptive Neural Architecture Search for Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[15]  Ramesh Raskar,et al.  Designing Neural Network Architectures using Reinforcement Learning , 2016, ICLR.

[16]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[17]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hanna Mazzawi,et al.  Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale , 2019, INTERSPEECH.

[19]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[20]  Liang Lin,et al.  SNAS: Stochastic Neural Architecture Search , 2018, ICLR.

[21]  Zhijian Ou,et al.  Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[22]  Dong Yu,et al.  Learned Transferable Architectures Can Surpass Hand-Designed Architectures for Large Scale Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[24]  Neural Architecture Search for Speech Recognition , 2020, ArXiv.

[25]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[26]  Lei Xie,et al.  Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition , 2020, INTERSPEECH.

[27]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Shifeng Zhang,et al.  DARTS+: Improved Differentiable Architecture Search with Early Stopping , 2019, ArXiv.

[31]  Yujun Wang,et al.  AutoKWS: Keyword Spotting with Differentiable Architecture Search , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).