Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized method generally performs best. While results are only comparable to the state-of-the-art in some cases, in all tasks a reasonable competing approach is outperformed at a substantially lower bitrate.

[1]  Okko Johannes Räsänen,et al.  Blind Phoneme Segmentation With Temporal Prediction Errors , 2016, ACL.

[2]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[3]  Maarten Versteegh,et al.  A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[5]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[6]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[7]  Jörg Franke,et al.  Phoneme Boundary Detection using Deep Bidirectional LSTMs , 2016, ITG Symposium on Speech Communication.

[8]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[9]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[10]  Thomas Niesler,et al.  Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders , 2018, INTERSPEECH.

[11]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[12]  Okko Johannes Räsänen,et al.  Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics , 2020, INTERSPEECH.

[13]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[14]  Okko Johannes Räsänen,et al.  Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level , 2014, CogSci.

[15]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[16]  Thomas Hain,et al.  Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders , 2020, INTERSPEECH.

[17]  Marina Nespor,et al.  Co-occurrence statistics as a language-dependent cue for speech segmentation. , 2017, Developmental science.

[18]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Nicolas Usunier,et al.  Joint Learning of Speaker and Phonetic Similarities with Siamese Networks , 2016, INTERSPEECH.

[21]  Gaetan Hadjeres,et al.  Vector Quantized Contrastive Predictive Coding for Template-based Music Generation , 2020, ArXiv.

[22]  Emmanuel Dupoux,et al.  Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[23]  Emmanuel Dupoux,et al.  Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[24]  Emmanuel Dupoux,et al.  Evaluating the reliability of acoustic speech embeddings , 2020, INTERSPEECH.

[25]  Hans J. G. A. Dolfing,et al.  Unsupervised Neural Segmentation and Clustering for Unit Discovery in Sequential Data , 2019 .

[26]  Joseph Keshet,et al.  Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation , 2020, INTERSPEECH.

[27]  William D. Raymond,et al.  The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability , 2005, Speech Commun..

[28]  F. Pellegrino,et al.  Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche , 2019, Science Advances.

[29]  Micha Elsner,et al.  Acquiring language from speech by learning to remember and predict , 2020, CONLL.

[30]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[31]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[32]  Unto K. Laine,et al.  An improved speech segmentation quality measure: the r-value , 2009, INTERSPEECH.

[33]  Okko Johannes Räsänen,et al.  Improving Phoneme segmentation with Recurrent Neural Networks , 2016, ArXiv.

[34]  Herman Kamper,et al.  Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks , 2020, IEEE Signal Processing Letters.

[35]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[36]  Joseph Keshet,et al.  Phoneme Boundary Detection Using Learnable Segmental Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Satoshi Nakamura,et al.  Learning Supervised Feature Transformations on Zero Resources for Improved Acoustic Unit Discovery , 2018, IEICE Trans. Inf. Syst..

[38]  Okko Johannes Räsänen,et al.  Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions , 2012, Speech Commun..

[39]  Benjamin van Niekerk,et al.  Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge , 2020, INTERSPEECH.

[40]  Hung-yi Lee,et al.  Gate Activation Signal Analysis for Gated Recurrent Neural Networks and its Correlation with Phoneme Boundaries , 2017, INTERSPEECH.

[41]  Ewald van der Westhuizen,et al.  Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks , 2019, INTERSPEECH.

[42]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[43]  James R. Glass,et al.  Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[44]  Lorenzo Rosasco,et al.  Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.

[45]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.