Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

When documenting oral-languages, Unsupervised Word Segmentation (UWS) from speech is a useful, yet challenging, task. It can be performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units which can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. Two UWS models are experimented with and we report results for Finnish, Hungarian, Mboshi, Romanian and Russian in a low-resource setting (using only 5k sentences). Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using the SHMM and H-SHMM Bayesian models, which produce high quality, yet compressed, discrete representations of the input speech signal.

[1]  Graham Neubig,et al.  Integrating automatic transcription into the language documentation workflow: Experiments with Na data and the Persephone toolkit , 2018 .

[2]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Satoshi Nakamura,et al.  Speech-to-Speech Translation Between Untranscribed Unknown Languages , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Aline Villavicencio,et al.  Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings , 2019, INTERSPEECH.

[5]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[7]  Benjamin van Niekerk,et al.  Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks , 2020, Interspeech.

[8]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[9]  Steven Bird,et al.  Sparse Transcription , 2021, Computational Linguistics.

[10]  Solomon Teferra Abate,et al.  An Amharic speech corpus for large vocabulary continuous speech recognition , 2005, INTERSPEECH.

[11]  Laurent Besacier,et al.  MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible , 2019, LREC.

[12]  Florian Schiel,et al.  Untrained Forced Alignment of Transcriptions and Audio for Language Documentation Corpora using WebMAUS , 2014, LREC.

[13]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[14]  Laurent Besacier,et al.  Developments of Swahili resources for an automatic speech recognition system , 2012, SLTU.

[15]  Pierre Godard Unsupervised word discovery for computational language documentation , 2019 .

[16]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[17]  Laurent Besacier,et al.  Investigating Language Impact in Bilingual Approaches for Computational Language Documentation , 2020, SLTU.

[18]  Lukás Burget,et al.  Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery , 2019, INTERSPEECH.

[19]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[20]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[21]  Scott Heath,et al.  Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS) , 2018, SLTU.

[22]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[23]  Aren Jansen,et al.  An evaluation of graph clustering methods for unsupervised term discovery , 2015, INTERSPEECH.

[24]  James R. Glass Towards unsupervised speech processing , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[25]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Aline Villavicencio,et al.  Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Aline Villavicencio,et al.  Unsupervised Word Segmentation from Speech with Attention , 2018, INTERSPEECH.

[28]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[29]  Murat Saraclar,et al.  A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  David Chiang,et al.  Leveraging translations for speech transcription in low-resource settings , 2018, INTERSPEECH.

[31]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[32]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[33]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[34]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[35]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[36]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Charles Yang,et al.  Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources , 2010, CoNLL.

[38]  Mark Johnson,et al.  Nonparametric bayesian models of lexical acquisition , 2007 .

[39]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[40]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[41]  Laurent Besacier,et al.  Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof , 2016, LREC.

[42]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[43]  Wen Wang,et al.  Toward human-assisted lexical unit discovery without text resources , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[44]  Emmanuel Dupoux,et al.  Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[45]  Caren Brinckmann,et al.  Transcription bottleneck of speech corpus exploitation , 2008 .

[46]  Thierry Moudenc,et al.  Speech technologies for african languages: example of a multilingual calculator for education , 2015, INTERSPEECH.

[47]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[48]  Alexandre Allauzen,et al.  Preliminary Experiments on Unsupervised Word Discovery in Mboshi , 2016, INTERSPEECH.

[49]  Okko Johannes Räsänen,et al.  Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics , 2020, INTERSPEECH.