Adaptive Activation Network for Low Resource Multilingual Speech Recognition

Low resource automatic speech recognition (ASR) is a useful but thorny task, since deep learning ASR models usually need huge amounts of training data. The existing models mostly established a bottleneck (BN) layer by pre-training on a large source language, and transferring to the low resource target language. In this work, we introduced an adaptive activation network to the upper layers of ASR model, and applied different activation functions to different languages. We also proposed two approaches to train the model: (1) cross-lingual learning, replacing the activation function from source language to target language, (2) multilingual learning, jointly training the Connectionist Temporal Classification (CTC) loss of each language and the relevance of different languages. Our experiments on IARPA Babel datasets demonstrated that our approaches outperform the from-scratch training and traditional bottleneck feature based methods. In addition, combining the cross-lingual learning and multilingual learning together could further improve the performance of multilingual speech recognition.

[1]  Lirong Dai,et al.  Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition , 2022, Circuits, Systems, and Signal Processing.

[2]  Jing Xiao,et al.  Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation , 2021, Interspeech.

[3]  Boris Ginsburg,et al.  Cross-Language Transfer Learning and Domain Adaptation for End-to-End Automatic Speech Recognition , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[4]  Rajesh Kumar Aggarwal,et al.  An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition , 2021, Journal of Reliable Intelligent Environments.

[5]  Pan Zhou,et al.  Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition , 2020, AAAI.

[6]  Jing Xiao,et al.  Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[7]  Yue Dong,et al.  Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning , 2020, INTERSPEECH.

[8]  Thomas Hain,et al.  Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages , 2020, INTERSPEECH.

[9]  Jing Xiao,et al.  Large-scale Transfer Learning for Low-resource Spoken Language Understanding , 2020, INTERSPEECH.

[10]  Hung-yi Lee,et al.  DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation , 2020, INTERSPEECH.

[11]  Kai Yu,et al.  Speaker Augmentation for Low Resource Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Peng Xu,et al.  Meta-Transfer Learning for Code-Switched Speech Recognition , 2020, ACL.

[13]  Xin Wang,et al.  Adaptive Activation Network and Functional Regularization for Efficient and Flexible Deep Multi-Task Learning , 2019, AAAI.

[14]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yonghong Yan,et al.  Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2019, INTERSPEECH.

[16]  G. Zweig,et al.  Multilingual Graphemic Hybrid ASR with Massive Data Augmentation , 2019, SLTU.

[17]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[18]  Florian Metze,et al.  Multilingual Speech Recognition with Corpus Relatedness Sampling , 2019, INTERSPEECH.

[19]  Jianhua Tao,et al.  Language-invariant Bottleneck Features from Adversarial End-to-end Acoustic Models for Low Resource Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Philip N. Garner,et al.  An Investigation of Multilingual ASR Using End-to-end LF-MMI , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Shinji Watanabe,et al.  Analysis of Multilingual Sequence-to-Sequence speech recognition systems , 2018, INTERSPEECH.

[23]  Florian Metze,et al.  Sequence-Based Multi-Lingual Low Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hervé Bourlard,et al.  An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation , 2017, INTERSPEECH.

[25]  Julius Kunze,et al.  Transfer Learning for Speech Recognition on a Budget , 2017, Rep4NLP@ACL.

[26]  Stavros Tsakalidis,et al.  Alternative networks for monolingual bottleneck features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Brian Kingsbury,et al.  Multilingual Data Selection for Low Resource Speech Recognition , 2016, INTERSPEECH.

[29]  Richard M. Schwartz,et al.  Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages , 2016, INTERSPEECH.

[30]  Sebastian Stüker,et al.  Language Adaptive DNNs for Improved Low Resource Speech Recognition , 2016, INTERSPEECH.

[31]  Carl Olsson,et al.  Convex envelopes for fixed rank approximation , 2016, Optimization Letters.

[32]  Henry Wolkowicz,et al.  Low-Rank Matrix Completion using Nuclear Norm with Facial Reduction , 2016, 1608.04168.

[33]  A. Ng,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[34]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[36]  Paul Tseng,et al.  Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning , 2010, SIAM J. Optim..

[37]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.