Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn’t hurt performance.

[1]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[2]  Okko Johannes Räsänen,et al.  SylNet: An Adaptable End-to-End Syllable Count Estimator for Speech , 2019, IEEE Signal Processing Letters.

[3]  Karen Livescu,et al.  Acoustic span embeddings for multilingual query-by-example search , 2020, ArXiv.

[4]  Thomas Hain,et al.  Contextual Joint Factor Acoustic Embeddings , 2019, ArXiv.

[5]  Lin-Shan Lee,et al.  Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Herman Kamper,et al.  Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[7]  Lin-Shan Lee,et al.  Phonetic-and-Semantic Embedding of Spoken words with Applications in Spoken Content Retrieval , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[9]  Prateek Verma,et al.  Audio-linguistic Embeddings for Spoken Sentences , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[11]  Karen Livescu,et al.  Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.

[12]  Lin-Shan Lee,et al.  Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection , 2018, ArXiv.

[13]  Sharon Goldwater,et al.  Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Sharon Goldwater,et al.  Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages , 2021, Comput. Speech Lang..

[15]  Karen Livescu,et al.  Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[16]  Karen Livescu,et al.  Semantic Query-by-example Speech Search Using Visual Grounding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[18]  Karen Livescu,et al.  Multilingual Jointly Trained Acoustic and Written Word Embeddings , 2020, INTERSPEECH.

[19]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[21]  Etienne Barnard,et al.  The NCHLT speech corpus of the South African languages , 2014, SLTU.

[22]  Sharon Goldwater,et al.  Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Emmanuel Dupoux,et al.  Evaluating the reliability of acoustic speech embeddings , 2020, INTERSPEECH.

[24]  Hoirin Kim,et al.  Additional Shared Decoder on Siamese Multi-View Encoders for Learning Acoustic Word Embeddings , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Herman Kamper,et al.  Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Okko Johannes Räsänen,et al.  Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics , 2020, INTERSPEECH.

[28]  Hung-yi Lee,et al.  Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Recurrent Neural Networks , 2016 .

[29]  Michael Picheny,et al.  Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Bin Ma,et al.  Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search , 2018, INTERSPEECH.

[31]  Junjie Wang,et al.  Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection , 2020, ArXiv.

[32]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[35]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[38]  Laura Martinus,et al.  Masakhane - Machine Translation For Africa , 2020, ArXiv.

[39]  Lukás Burget,et al.  Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery , 2019, INTERSPEECH.

[40]  Florian Metze,et al.  Learned in Speech Recognition: Contextual Acoustic Word Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Joseph Keshet,et al.  Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation , 2020, INTERSPEECH.

[42]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Emmanuel Dupoux,et al.  Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments , 2018, INTERSPEECH.