Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity

Speech processing systems currently do not support the vast majority of languages, in part due to the lack of data in low-resource languages. Cross-lingual transfer offers a compelling way to help bridge this digital divide by incorporating high-resource data into low-resource systems. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some lowresource languages. However, scaling up speech systems to support hundreds of low-resource languages remains unsolved. To help bridge this gap, we propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages. We demonstrate the effectiveness of our approach in language family classification, speech recognition, and speech synthesis tasks.

[1]  Juan Pino,et al.  Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation , 2020, INTERSPEECH.

[2]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Graham Neubig,et al.  Choosing Transfer Languages for Cross-Lingual Learning , 2019, ACL.

[4]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[5]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alan W. Black,et al.  Open-Source Consumer-Grade Indic Text To Speech , 2016, SSW.

[7]  Florian Metze,et al.  Sequence-Based Multi-Lingual Low Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Chris Kayne Richard Collins,et al.  Syntactic Structures of the World's Languages (SSWL) , 2009 .

[9]  Gabriel Synnaeve,et al.  Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters , 2020, INTERSPEECH.

[10]  Florian Metze,et al.  Multilingual Speech Recognition with Corpus Relatedness Sampling , 2019, INTERSPEECH.

[11]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[12]  W. Voiers,et al.  Diagnostic acceptability measure for speech communication systems , 1977 .

[13]  Robert Forkel,et al.  The World Atlas of Language Structures Online , 2009 .

[14]  Patrick Littell,et al.  URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , 2017, EACL.

[15]  Tanja Schultz,et al.  Voice building from insufficient data - classroom experiences with web-based language development tools , 2007, SSW.

[16]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Su-Youn Yoon,et al.  A Python Toolkit for Universal Transliteration , 2010, LREC.

[18]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[19]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[22]  Ryan Cotterell,et al.  SIGTYP 2021 Shared Task: Robust Spoken Language Identification , 2021, SIGTYP.

[23]  Charles F. F. Karney Algorithms for geodesics , 2011, Journal of Geodesy.

[24]  Alan W Black,et al.  Submission from CMU for Blizzard Challenge 2019 , 2018 .

[25]  David Yarowsky,et al.  Massively Multilingual Adversarial Speech Recognition , 2019, NAACL.

[26]  Alan W Black,et al.  Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios , 2020, ArXiv.

[27]  Ralph Roskies,et al.  Bridges: a uniquely flexible HPC resource for new communities and data analytics , 2015, XSEDE.

[28]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[30]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[31]  Yue Dong,et al.  Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning , 2020, INTERSPEECH.

[32]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[33]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[34]  Alan W Black,et al.  Towards Zero-shot Learning for Automatic Phonemic Transcription , 2020, AAAI.

[35]  Xiangang Li,et al.  GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio , 2021, Interspeech.

[36]  Ruslan Salakhutdinov,et al.  Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment , 2020, ACM Multimedia.

[37]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2020, AAAI.

[38]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[39]  Alan W. Black,et al.  Random forests for statistical speech synthesis , 2015, INTERSPEECH.

[40]  Xiaodong Cui,et al.  Network architectures for multilingual speech representation learning , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[42]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Claire Cardie,et al.  Multi-Source Cross-Lingual Model Transfer: Learning What to Share , 2018, ACL.

[44]  Benjamin Lecouteux,et al.  Using resources from a closely-related language to develop ASR for a very under-resourced language: a case study for iban , 2015, INTERSPEECH.

[45]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[46]  Alan W Black,et al.  Ordinal Triplet Loss: Investigating Sleepiness Detection from Speech , 2019, INTERSPEECH.

[47]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[48]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[49]  Tomoki Toda,et al.  Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[51]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[52]  Supheakmungkol Sarin,et al.  Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali , 2018, SLTU.

[53]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[54]  Yuke Zhu,et al.  MultiBench: Multiscale Benchmarks for Multimodal Representation Learning , 2021, NeurIPS Datasets and Benchmarks.