论文信息 - Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized method generally performs best. While results are only comparable to the state-of-the-art in some cases, in all tasks a reasonable competing approach is outperformed at a substantially lower bitrate.

Benjamin van Niekerk | Herman Kamper | H. Kamper | B. V. Niekerk

[1] Okko Johannes Räsänen,et al. Blind Phoneme Segmentation With Temporal Prediction Errors , 2016, ACL.

[2] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[3] Maarten Versteegh,et al. A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[5] Sakriani Sakti,et al. The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[6] Sakriani Sakti,et al. The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[7] Jörg Franke,et al. Phoneme Boundary Detection using Deep Bidirectional LSTMs , 2016, ITG Symposium on Speech Communication.

[8] Thomas L. Griffiths,et al. Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[9] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[10] Thomas Niesler,et al. Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders , 2018, INTERSPEECH.

[11] Aren Jansen,et al. Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[12] Okko Johannes Räsänen,et al. Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics , 2020, INTERSPEECH.

[13] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[14] Okko Johannes Räsänen,et al. Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level , 2014, CogSci.

[15] Aren Jansen,et al. A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[16] Thomas Hain,et al. Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders , 2020, INTERSPEECH.

[17] Marina Nespor,et al. Co-occurrence statistics as a language-dependent cue for speech segmentation. , 2017, Developmental science.

[18] Kenneth Ward Church,et al. A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Ron J. Weiss,et al. Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Nicolas Usunier,et al. Joint Learning of Speaker and Phonetic Similarities with Siamese Networks , 2016, INTERSPEECH.

[21] Gaetan Hadjeres,et al. Vector Quantized Contrastive Predictive Coding for Template-based Music Generation , 2020, ArXiv.

[22] Emmanuel Dupoux,et al. Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[23] Emmanuel Dupoux,et al. Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[24] Emmanuel Dupoux,et al. Evaluating the reliability of acoustic speech embeddings , 2020, INTERSPEECH.

[25] Hans J. G. A. Dolfing,et al. Unsupervised Neural Segmentation and Clustering for Unit Discovery in Sequential Data , 2019 .

[26] Joseph Keshet,et al. Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation , 2020, INTERSPEECH.

[27] William D. Raymond,et al. The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability , 2005, Speech Commun..

[28] F. Pellegrino,et al. Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche , 2019, Science Advances.

[29] Micha Elsner,et al. Acquiring language from speech by learning to remember and predict , 2020, CONLL.

[30] Aren Jansen,et al. Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[31] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[32] Unto K. Laine,et al. An improved speech segmentation quality measure: the r-value , 2009, INTERSPEECH.

[33] Okko Johannes Räsänen,et al. Improving Phoneme segmentation with Recurrent Neural Networks , 2016, ArXiv.

[34] Herman Kamper,et al. Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks , 2020, IEEE Signal Processing Letters.

[35] Aren Jansen,et al. A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[36] Joseph Keshet,et al. Phoneme Boundary Detection Using Learnable Segmental Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Satoshi Nakamura,et al. Learning Supervised Feature Transformations on Zero Resources for Improved Acoustic Unit Discovery , 2018, IEICE Trans. Inf. Syst..

[38] Okko Johannes Räsänen,et al. Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions , 2012, Speech Commun..

[39] Benjamin van Niekerk,et al. Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge , 2020, INTERSPEECH.

[40] Hung-yi Lee,et al. Gate Activation Signal Analysis for Gated Recurrent Neural Networks and its Correlation with Phoneme Boundaries , 2017, INTERSPEECH.

[41] Ewald van der Westhuizen,et al. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks , 2019, INTERSPEECH.

[42] Karen Livescu,et al. An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[43] James R. Glass,et al. Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[44] Lorenzo Rosasco,et al. Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.

[45] T. Griffiths,et al. A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.