Unsupervised Domain Adaptation Under Label Space Mismatch for Speech Classification

Unsupervised domain adaptation using adversarial learning has shown promise in adapting speech models from a labeled source domain to an unlabeled target domain. However, prior works make a strong assumption that the label spaces of source and target domains are identical, which can be easily violated in realworld conditions. We present AMLS, an end-to-end architecture that performs Adaptation under Mismatched Label Spaces using two weighting schemes to separate shared and private classes in each domain. An evaluation on three speech adaptation tasks, namely gender, microphone, and emotion adaptation, shows that AMLS provides significant accuracy gains over baselines used in speech and vision adaptation tasks. Our contribution paves the way for applying UDA to speech models in unconstrained settings with no assumptions on the source and target label spaces.

[1]  Hong Liu,et al.  Separate to Adapt: Open Set Domain Adaptation via Progressive Separation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[3]  AbdelwahabMohammed,et al.  Domain Adversarial for Acoustic Emotion Recognition , 2018 .

[4]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ruiyu Liang,et al.  Speech Emotion Classification Using Attention-Based LSTM , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[8]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[9]  Changick Kim,et al.  Pseudo-Labeling Curriculum for Unsupervised Domain Adaptation , 2019, BMVC.

[10]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[11]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[13]  Milos Cernak,et al.  End-to-End Accented Speech Recognition , 2019, INTERSPEECH.

[14]  Lei Xie,et al.  Unsupervised Adaptation with Adversarial Dropout Regularization for Robust Speech Recognition , 2019, INTERSPEECH.

[15]  Siddique Latif,et al.  Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).

[16]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Carlos Busso,et al.  Domain Adversarial for Acoustic Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Davis Liang,et al.  Learning Noise-Invariant Representations for Robust Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19]  Nicholas D. Lane,et al.  Mic2Mic: using cycle-consistent generative adversarial networks to overcome microphone variability in speech systems , 2019, IPSN.

[20]  Lei Xie,et al.  Domain Adversarial Training for Improving Keyword Spotting Performance of ESL Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Michael I. Jordan,et al.  Universal Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[23]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[24]  Jianmin Wang,et al.  Partial Adversarial Domain Adaptation , 2018, ECCV.

[25]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[26]  Tatsuya Harada,et al.  Open Set Domain Adaptation by Backpropagation , 2018, ECCV.

[27]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Thomas Fang Zheng,et al.  Noisy training for deep neural networks in speech recognition , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[29]  Hyung-Min Park,et al.  Unsupervised Speech Domain Adaptation Based on Disentangled Representation Learning for Robust Speech Recognition , 2019, ArXiv.

[30]  Haizhou Li,et al.  Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[32]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[33]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[34]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[37]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[38]  Yifan Gong,et al.  Domain and Speaker Adaptation for Cortana Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).