Towards Learning a Universal Non-Semantic Representation of Speech

The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective. The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The benchmark, models, and evaluation code are publicly released.

[1]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[3]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[4]  Stylianos Asteriadis,et al.  Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).

[5]  Yu-An Chung,et al.  Generative Pre-Training for Speech with Autoregressive Predictive Coding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Rajib Rana,et al.  Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends , 2020, ArXiv.

[8]  Josien P. W. Pluim,et al.  Not‐so‐supervised: A survey of semi‐supervised, multi‐instance, and transfer learning in medical image analysis , 2018, Medical Image Anal..

[9]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[10]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[11]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[12]  Björn W. Schuller,et al.  Combining frame and turn-level information for robust recognition of emotions within speech , 2007, INTERSPEECH.

[13]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[14]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[15]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[17]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[18]  Björn W. Schuller,et al.  Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[20]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[22]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[23]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Chuang Gan,et al.  Deep Audio Priors Emerge From Harmonic Convolutional Networks , 2020, ICLR.

[25]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  Chenliang Xu,et al.  Preprint-work in progress , 2019 .

[27]  Shauna Revay,et al.  Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals , 2019, ArXiv.

[28]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[29]  Yang Song,et al.  Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Frank Rudzicz,et al.  On the importance of normative data in speech-based assessment , 2017, ArXiv.

[31]  Björn Schuller,et al.  Latest Advances in Computational Speech Analysis for Mobile Sensing , 2019, Studies in Neuroscience, Psychology and Behavioral Economics.

[32]  Carlos Busso,et al.  Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes , 2018, INTERSPEECH.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[35]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Jan Vanek,et al.  A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task , 2018, TSD.

[37]  Xiaohua Zhai,et al.  The Visual Task Adaptation Benchmark , 2019, ArXiv.

[38]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[39]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[40]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[41]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[42]  Lior Wolf,et al.  Audio Denoising with Deep Network Priors , 2019, ArXiv.

[43]  Carla Lopes,et al.  Phone Recognition on the TIMIT Database , 2012 .

[44]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Rajib Rana,et al.  Variational Autoencoders for Learning Latent Representations of Speech Emotion , 2017, INTERSPEECH.

[46]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.