An empirical evaluation of zero resource acoustic unit discovery

Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.

[1]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[2]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[3]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[4]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[5]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Timothy J. Hazen MCE Training Techniques for Topic Identification of Spoken Audio Documents , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Aren Jansen,et al.  An evaluation of graph clustering methods for unsupervised term discovery , 2015, INTERSPEECH.

[8]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[11]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Aren Jansen,et al.  NLP on Spoken Documents Without ASR , 2010, EMNLP.

[14]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[15]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[17]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[18]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[19]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[20]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  Bin Ma,et al.  Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[23]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[24]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[25]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Satoshi Nakamura,et al.  Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario , 2016, SLTU.