Bottom-Up Unsupervised Word Discovery via Acoustic Units

Unsupervised term discovery is the task of identifying and grouping reoccurring word-like patterns from the untranscribed audio data. It facilitates unsupervised acoustic model training in zero resource setting where no or minimal transcribed speech is available. In this paper, we investigate two-step bottom-up approaches for unsupervised discovery of word-like units. The first step discovers phone-like acoustic units from data and the second step combines the basic acoustic blocks to identify word-like units. We investigated Embedded Segmental K-means and Nested Hierarchical Pitman-Yor (PYR) model as bottom-up strategies. ESK-Means iteratively selects boundaries from an initial set to arrive at the word boundaries. The final performance critically depends on the quality of the initial boundaries. We used a segmentation method that discovers boundaries much closer to actual boundaries. PYR model has been used for word segmentation from space removed text data, and here we use it for word discovery from unsupervised acoustic units. The term discovery performance is evaluated on the Zero Resource 2017 challenge dataset, which consists of around 70 hours of unlabelled data. Our systems outperformed the baseline systems on all the languages without language-specific parameter tuning. We performed comprehensive experiments of the system parameters on the system performance.

[1]  Bhiksha Raj,et al.  Iterative Bayesian word segmentation for unsupervised vocabulary discovery from phoneme lattices , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  James R. Glass,et al.  On the Use of Acoustic Unit Discovery for Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Nicolas Usunier,et al.  Joint Learning of Speaker and Phonetic Similarities with Siamese Networks , 2016, INTERSPEECH.

[4]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[5]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Victor V. Kromer,et al.  About Word Length Distribution , 2007 .

[7]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[8]  Lukás Burget,et al.  An empirical evaluation of zero resource acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Lukás Burget,et al.  Topic identification of spoken documents using unsupervised acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[14]  Sanjeev Khudanpur,et al.  Automatic Speech Recognition and Topic Identification for Almost-Zero-Resource Languages , 2018, 1802.08731.

[15]  Bhiksha Raj,et al.  Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery , 2017, INTERSPEECH.

[16]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Bhiksha Raj,et al.  Unsupervised word segmentation from noisy input , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[20]  D. Aldous Exchangeability and related topics , 1985 .

[21]  Giorgio Metta,et al.  An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[23]  K. Sri Rama Murty,et al.  Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications , 2017, INTERSPEECH.

[24]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[25]  Herbert Gish,et al.  Unsupervised training of an HMM-based speech recognizer for topic classification , 2009, INTERSPEECH.

[26]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[27]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[29]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Bogdan Ludusan,et al.  Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[31]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[32]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Herman Kamper,et al.  Phoneme Based Embedded Segmental K-Means for Unsupervised Term Discovery , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Sanjeev Khudanpur,et al.  Topic Identification for Speech Without ASR , 2017, INTERSPEECH.