Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance.

[1]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[2]  Chengzhu Yu,et al.  Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching , 2018, ICLR.

[3]  Odette Scharenborg,et al.  Finding Maximum Margin Segments in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[5]  Mark Hasegawa-Johnson,et al.  Accurate speech segmentation by mimicking human auditory processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[8]  Cynthia G. Clopper,et al.  Automatic measurement of vowel duration via structured prediction , 2016, The Journal of the Acoustical Society of America.

[9]  Joseph Keshet,et al.  Phoneme Boundary Detection Using Learnable Segmental Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Joseph Keshet,et al.  Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production , 2016, Cognition.

[11]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[12]  Bernd Pompino-Marschall,et al.  Theoretical principles concerning segmentation, labelling strategies and levels of categorical annotation for spoken language database systems , 1993, EUROSPEECH.

[13]  Jörg Franke,et al.  Phoneme Boundary Detection using Deep Bidirectional LSTMs , 2016, ITG Symposium on Speech Communication.

[14]  Mohammad Hossein Moattar,et al.  A review on speaker diarization systems and approaches , 2012, Speech Commun..

[15]  Yoram Singer,et al.  Phoneme alignment based on discriminative learning , 2005, INTERSPEECH.

[16]  Okko Johannes Räsänen,et al.  Improving Phoneme segmentation with Recurrent Neural Networks , 2016, ArXiv.

[17]  Hsiao-Chuan Wang,et al.  Blind phone segmentation based on spectral change detection using Legendre polynomial approximation. , 2015, The Journal of the Acoustical Society of America.

[18]  Unto K. Laine,et al.  An improved speech segmentation quality measure: the r-value , 2009, INTERSPEECH.

[19]  Richard M. Schwartz,et al.  Transcribing radio news , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[21]  Okko Johannes Räsänen,et al.  Blind Phoneme Segmentation With Temporal Prediction Errors , 2016, ACL.

[22]  Gayatri M. Bhandari,et al.  Audio Segmentation for Speech Recognition Using Segment Features , 2014 .

[23]  Okko Johannes Räsänen,et al.  Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level , 2014, CogSci.

[24]  Joseph Keshet,et al.  AUTOMATIC TOOLS FOR ANALYZING SPOKEN HEBREW , 2016 .

[25]  Unto K. Laine,et al.  Blind Segmentation of Speech Using Non-Linear Filtering Methods , 2011 .

[26]  Hung-yi Lee,et al.  Gate Activation Signal Analysis for Gated Recurrent Neural Networks and its Correlation with Phoneme Boundaries , 2017, INTERSPEECH.

[27]  Constantine Kotropoulos,et al.  Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion , 2008, Speech Commun..

[28]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[29]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[30]  Joseph Keshet,et al.  Vowel duration measurement using deep neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[31]  William D. Raymond,et al.  The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability , 2005, Speech Commun..

[32]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[33]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[36]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .