Fe b 20 19 TO REVERSE THE GRADIENT OR NOT : AN EMPIRICAL COMPARISON OF ADVERSARIAL AND MULTITASK LEARNING IN SPEECH RECOGNITION *

Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; we expect a performance improvement with this joint training if the two tasks of speech recognition and speaker recognition share a common set of underlying features. In contrast, adversarial learning is a means to learn representations invariant to the speaker. We then expect better performance if this learnt invariance helps generalizing to new speakers. While the two approaches seem natural in the context of speech recognition, they are incompatible because they correspond to opposite gradients back-propagated to the model. In order to better understand the effect of these approaches in terms of error rates, we compare both strategies in controlled settings. Moreover, we explore the use of additional un-transcribed data in a semi-supervised, adversarial learning manner to improve error rates. Our results show that deep models trained on big datasets already develop invariant representations to speakers without any auxiliary loss. When considering adversarial learning and multi-task learning, the impact on the acoustic model seems minor. However, models trained in a semi-supervised manner can improve error-rates.

[1]  Dimitri Palaz,et al.  Jointly Learning to Locate and Classify Words Using Convolutional Networks , 2016, INTERSPEECH.

[2]  Bhuvana Ramabhadran,et al.  Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.

[3]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[4]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yonatan Belinkov,et al.  Analysis of sentence embedding models using prediction tasks in natural language processing , 2017, IBM J. Res. Dev..

[6]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[7]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[8]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[9]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[10]  Bhaskar Mitra,et al.  Cross Domain Regularization for Neural Ranking Models using Adversarial Learning , 2018, SIGIR.

[11]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[12]  Tetsuji Ogawa,et al.  Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[14]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[15]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[16]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[17]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[18]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[19]  Guillaume Lample,et al.  Fader Networks: Manipulating Images by Sliding Attributes , 2017, NIPS.

[20]  Dong Wang,et al.  Multi-task recurrent model for speech and speaker recognition , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[21]  Sanjeev Khudanpur,et al.  Reverberation robust acoustic modeling using i-vectors with time delay neural networks , 2015, INTERSPEECH.

[22]  William Chan,et al.  Deep Recurrent Neural Networks for Acoustic Modelling , 2015, ArXiv.

[23]  Mei-Yuh Hwang,et al.  Domain Adversarial Training for Accented Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[25]  Thierry Dutoit,et al.  Multi-task learning for speech recognition: an overview , 2016, ESANN.

[26]  Frédéric Jurie,et al.  An Adversarial Regularisation for Semi-Supervised Training of Structured Output Neural Networks , 2017, NIPS 2017.

[27]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[28]  Thierry Dutoit,et al.  Speaker-aware long short-term memory multi-task learning for speech recognition , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[29]  Gabriel Synnaeve,et al.  Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[30]  Biing-Hwang Juang,et al.  Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[32]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).