Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings

Recent literature presents evidence that both linguistic (phonemic) and non linguistic (speaker identity, emotional content) information resides at a lower dimensional manifold embedded richly inside the higher-dimensional spectral features like MFCC and PLP. Linguistic or phonetic units of speech can be broken down to a legal inventory of articulatory gestures shared across several phonemes based on their manner of articulation. We intend to discover a subspace rich in gestural information of speech and captures the invariance of similar gestures. In this paper, we investigate unsupervised techniques best suited for learning such a subspace. Main contribution of the paper is an approach to learn gesture-rich representation of speech automatically from data in completely unsupervised manner. This study compares the representations obtained through convolutional autoencoder (ConvAE) and standard unsupervised dimensionality reduction techniques such as manifold learning and Principal Component Analysis (PCA) through the task of phoneme classification. Manifold learning techniques such as Locally Linear Embedding (LLE), Isomap and Laplacian Eigenmaps are evaluated in this study. The representations which best separate different gestures are suitable for discovering subword units in case of low or zero resource speech conditions. Further, we evaluate the representation using Zero Resource Speech Challenge’s ABX discriminability measure. Results indicate that representation obtained through ConvAE and Isomap out-perform baseline MFCC features in the task of phoneme classification as well as ABX measure and induce separation between sounds composed of different set of gestures. We further cluster the representations using Dirichlet Process Gaussian Mixture Model (DPGMM) to automatically learn the cluster distribution of data and show that these clusters correspond to groups of similar manner of articulation. DPGMM distribution is used as apriori to obtain correspondence terms for robust ConvAE training.

[1]  Louis Goldstein,et al.  Articulatory gestures as phonological units , 1989, Phonology.

[2]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Richard C. Rose,et al.  Noise aware manifold learning for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[6]  Louis Goldstein,et al.  Dynamics and articulatory phonology , 1996 .

[7]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[8]  Richard C. Rose,et al.  Application of a locality preserving discriminant analysis approach to ASR , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[9]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[11]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[12]  Dilan Görür,et al.  Dirichlet process Gaussian mixture models: choice of the base distribution , 2010 .

[13]  Zhang Xiong,et al.  3D object retrieval with stacked local convolutional autoencoder , 2015, Signal Process..

[14]  Chin-Hui Lee,et al.  An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition , 2013, Proceedings of the IEEE.

[15]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[16]  Andrew Errity,et al.  An investigation of manifold learning for speech analysis , 2006, INTERSPEECH.

[17]  Brendan J. Frey,et al.  Winner-Take-All Autoencoders , 2014, NIPS.

[18]  Chun Chen,et al.  Emotional Speech Analysis on Nonlinear Manifold , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[19]  Richard C. Rose,et al.  Efficient manifold learning for speech recognition using locality sensitive hashing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[21]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  S. Blumstein,et al.  Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. , 1979, The Journal of the Acoustical Society of America.

[23]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[24]  Louis Goldstein,et al.  Towards an articulatory phonology , 1986, Phonology.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  Lorenzo Rosasco,et al.  Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.