Phoneme Based Embedded Segmental K-Means for Unsupervised Term Discovery

Identifying and grouping the frequently occurring word-like patterns from raw acoustic waveforms is an important task in the zero resource speech processing. Embedded segmental K-means (ES-KMeans) discovers both the word boundaries and the word types from raw data. Starting from an initial set of subword boundaries, the ES-Kmeans iteratively eliminates some of the boundaries to arrive at frequently occurring longer word patterns. Notice that the initial word boundaries will not be adjusted during the process. As a result, the performance of the ES-Kmeans critically depends on the initial subword boundaries. Originally, syllable boundaries were used to initialize ES-Kmeans. In this paper, we propose to use a phoneme segmentation method that produces boundaries closer to true boundaries for ES-KMeans initialization. The use of shorter units increases the number of initial boundaries which leads to a significant increment in the computational complexity. To reduce the computational cost, we extract compact lower dimensional embeddings from an auto-encoder. The proposed algorithm is benchmarked on Zero Resource 2017 challenge, which consists of 70 hours of unlabeled data across three languages, viz. English, French, and Mandarin. The proposed algorithm outperforms the baseline system without any language-specific parameter tuning.

[1]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[2]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[3]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Nicolas Usunier,et al.  Joint Learning of Speaker and Phonetic Similarities with Siamese Networks , 2016, INTERSPEECH.

[7]  Hugo Van hamme,et al.  Joint training of non-negative Tucker decomposition and discrete density hidden Markov models , 2013, Comput. Speech Lang..

[8]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[10]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  P. Kuhl Early language acquisition: cracking the speech code , 2004, Nature Reviews Neuroscience.

[12]  Sri Rama Murty Kodukula,et al.  Unsupervised Spoken Term Discovery for Zero Resource Speech Processing , 2017 .

[13]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Bogdan Ludusan,et al.  Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[15]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[17]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Herbert Gish,et al.  Unsupervised training of an HMM-based speech recognizer for topic classification , 2009, INTERSPEECH.

[19]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  K. Sri Rama Murty,et al.  Unsupervised Segmentation of Speech Signals Using Kernel-Gram Matrices , 2017, NCVPRIPG.

[21]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Aren Jansen,et al.  Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[25]  Giorgio Metta,et al.  An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[27]  K. Sri Rama Murty,et al.  Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications , 2017, INTERSPEECH.

[28]  Bernhard Schölkopf,et al.  A Primer on Kernel Methods , 2004 .

[29]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.