HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equivalent Conformer. (The HyperConformer recipe is publicly available in: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)

[1]  Kyu J. Han,et al.  E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[2]  I. Lane,et al.  Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding , 2022, ICML.

[3]  Nicholas D. Lane,et al.  Federated Self-supervised Speech Representations: Are We There Yet? , 2022, INTERSPEECH.

[4]  F. Fleuret,et al.  HyperMixer: An MLP-based Low Cost Alternative to Transformers , 2022, 2203.03691.

[5]  Robin Scheibler,et al.  MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition , 2022, ArXiv.

[6]  Wenhao Jiang,et al.  DynaMixer: A Vision MLP Architecture with Dynamic Mixing , 2022, ICML.

[7]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[8]  Ping Luo,et al.  CycleMLP: A MLP-like Architecture for Dense Prediction , 2021, ICLR.

[9]  Titouan Parcollet,et al.  SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[10]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[11]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[12]  Titouan Parcollet,et al.  The Energy and Carbon Footprint of Training End-to-End Speech Recognizers , 2021, Interspeech.

[13]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[14]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[15]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[16]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[17]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[19]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[20]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[21]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[22]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[25]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[26]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[27]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[28]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[31]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[32]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .