Unsupervised Phonetic and Word Level Discovery for Speech to Speech Translation for Unwritten Languages

We experiment with unsupervised methods for deriving and clustering symbolic representations of speech, working towards speech-to-speech translation for languages without regular (or any) written representations. We consider five low-resource African languages, and we produce three different segmental representations of text data for comparisons against four different segmental representations derived solely from acoustic data for each language. The text and speech data for each language comes from the CMU Wilderness dataset introduced in [1], where speakers read a version of the New Testament in their language. Our goal is to evaluate the translation performance not only of acoustically derived units but also of discovered sequences or “words” made from these units, with the intuition that such representations will encode more meaning than phones alone. We train statistical machine translation models for each representation and evaluate their outputs on the basis of BLEU-1 scores to determine their efficacy. Our experiments produce encouraging results: as we cluster our atomic phonetic representations into more word-like units, the amount information retained generally approaches that of the actual words themselves.

[1]  Herman Kamper,et al.  Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Graham Neubig,et al.  Learning Language Representations for Typology Prediction , 2017, EMNLP.

[3]  Alan W. Black,et al.  Deriving Phonetic Transcriptions and Discovering Word Segmentations for Speech-to-Speech Translation in Low-Resource Settings , 2016, INTERSPEECH.

[4]  Alan W. Black,et al.  Text to speech in new languages without a standardized orthography , 2013, SSW.

[5]  Lukás Burget,et al.  Bayesian phonotactic Language Model for Acoustic Unit Discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[7]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[8]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[12]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[16]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[17]  Alan W. Black,et al.  Bootstrapping Text-to-Speech for speech processing in languages without an orthography , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Roger K. Moore,et al.  Discovering the phoneme inventory of an unwritten language: A machine-assisted approach , 2014, Speech Commun..

[19]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[22]  Sharon Goldwater,et al.  Multilingual bottleneck features for subword modeling in zero-resource languages , 2018, INTERSPEECH.

[23]  Florian Metze,et al.  Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[26]  Alan W. Black,et al.  Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Bin Ma,et al.  Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Adam Lopez,et al.  Low-Resource Speech-to-Text Translation , 2018, INTERSPEECH.

[29]  Raju Uma,et al.  A New Algorithm For Data Compression , 2013 .

[30]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[31]  Alan W. Black,et al.  Using articulatory features and inferred phonological segments in zero resource speech processing , 2015, INTERSPEECH.

[32]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[33]  A. Black,et al.  Building an ASR System for a Low-resource Language Through the Adaptation of a High-resource Language ASR System: Preliminary Results , 2017 .

[34]  Jörg Franke,et al.  Towards phoneme inventory discovery for documentation of unwritten languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Su-Youn Yoon,et al.  A Python Toolkit for Universal Transliteration , 2010, LREC.

[36]  Lukás Burget,et al.  An empirical evaluation of zero resource acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).