Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages

Discovering symbolic units from unannotated speech data is fundamental in zero resource speech technology. Previous studies focused on learning fixed-length frame units based on acoustic features. Although they achieve high quality, they also suffer from a high bit-rate due to time-frame encoding. In this work, to discover variable-length, low bit-rate speech representation from a limited amount of unannotated speech data, we propose an approach based on graph neural networks (GNNs), and we study the temporal closeness of salient speech features. Our approach is built upon vector-quantized neural networks (VQNNs), which learn discrete encoding by contrastive predictive coding (CPC). We exploit the predetermined finite set of embeddings (a codebook) used by VQNNs to encode input data. We consider a codebook a set of nodes in a directed graph, where each arc represents the transition from one feature to another. Subsequently, we extract and encode the topological features of nodes in the graph to cluster them using graph convolution. By this process, we can obtain coarsened speech representation. We evaluated our model on the English data set of the ZeroSpeech 2020 challenge on Track 2019. Our model successfully drops the bit rate while achieving high unit quality.

[1]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[2]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[3]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Soummya Kar,et al.  Topology adaptive graph convolutional networks , 2017, ArXiv.

[5]  Ewald van der Westhuizen,et al.  Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks , 2019, INTERSPEECH.

[6]  Benjamin van Niekerk,et al.  Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge , 2020, INTERSPEECH.

[7]  Bin Ma,et al.  Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[8]  Khalil Sima'an,et al.  Graph Convolutional Encoders for Syntax-aware Neural Machine Translation , 2017, EMNLP.

[9]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[10]  Berrak Sisman,et al.  Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[12]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[13]  Hynek Hermansky,et al.  Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.

[14]  Srinivasan Parthasarathy,et al.  Symmetrizations for clustering directed graphs , 2011, EDBT/ICDT '11.

[15]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[16]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[17]  Jing Xiao,et al.  GraphTTS: Graph-to-Sequence Modelling in Neural Text-to-Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Satoshi Nakamura,et al.  Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[20]  Haizhou Li,et al.  VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[21]  Satoshi Nakamura,et al.  Optimizing DPGMM Clustering in Zero Resource Setting Based on Functional Load , 2018, SLTU.

[22]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[23]  Lorenzo Rosasco,et al.  Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.

[24]  Sheng Wu,et al.  Graph Convolution-Based Deep Clustering for Speech Separation , 2020, IEEE Access.

[25]  Panagiotis Tzirakis,et al.  Multi-Channel Speech Enhancement Using Graph Neural Networks , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Satoshi Nakamura,et al.  Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge , 2020, INTERSPEECH.

[27]  Jianhua Tao,et al.  Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks , 2020, INTERSPEECH.

[28]  Benjamin van Niekerk,et al.  Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks , 2020, Interspeech.

[29]  Emmanuel Müller,et al.  Graph Clustering with Graph Neural Networks , 2020, ArXiv.