Subspace-Based Representation and Learning for Phonotactic Spoken Language Recognition

Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the “General LR” test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.

[1]  Bruce Hayes,et al.  A Maximum Entropy Model of Phonotactics and Phonotactic Learning , 2008, Linguistic Inquiry.

[2]  Aaron Lawson,et al.  Exploring the role of phonetic bottleneck features for speaker and language recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[4]  Lukás Burget,et al.  Learning Document Representations Using Subspace Multinomial Model , 2016, INTERSPEECH.

[5]  Berlin Chen,et al.  Linear discriminant feature extraction using weighted classification confusion information , 2008, INTERSPEECH.

[6]  Brian C. Lovell,et al.  Kernel analysis on Grassmann manifolds for action recognition , 2013, Pattern Recognit. Lett..

[7]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[8]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[9]  Tao Ma,et al.  Continuous speech recognition using linear dynamic models , 2014, Int. J. Speech Technol..

[10]  Alan Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[11]  Haizhou Li,et al.  A hierarchical framework for language identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[13]  Koji Tsuda Subspace classifier in the Hilbert space , 1999, Pattern Recognit. Lett..

[14]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Marc A. Zissman,et al.  Automatic language identification , 2001, Speech Commun..

[16]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[17]  Gene H. Golub,et al.  Numerical methods for computing angles between linear subspaces , 1971, Milestones in Matrix Computation.

[18]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[19]  Andrew L. Maas,et al.  A Probabilistic Model for Semantic Word Vectors , 2010 .

[20]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[21]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[22]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[23]  F. Mezzadri How to generate random matrices from the classical compact groups , 2006, math-ph/0609050.

[24]  Rong Tong,et al.  A Target-Oriented Phonotactic Front-End for Spoken Language Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Raymond W. M. Ng,et al.  Unsupervised crosslingual adaptation of tokenisers for spoken language recognition , 2017, Comput. Speech Lang..

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Alan McCree,et al.  Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15 , 2016, Odyssey.

[28]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[29]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[30]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[31]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[32]  Ricardo de Córdoba,et al.  Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Rama Chellappa,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Matching Shape Sequences in Video with Applications in Human Movement Analysis. Ieee Transactions on Pattern Analysis and Machine Intelligence 2 , 2022 .

[34]  Kyu Jeong Han,et al.  Frame-based phonotactic Language Identification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[35]  Lukás Burget,et al.  iVector Approach to Phonotactic Language Recognition , 2011, INTERSPEECH.

[36]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[37]  Alex Simpkins,et al.  System Identification: Theory for the User, 2nd Edition (Ljung, L.; 1999) [On the Shelf] , 2012, IEEE Robotics & Automation Magazine.

[38]  Ke Huang,et al.  Sparse Representation for Signal Classification , 2006, NIPS.

[39]  Rama Chellappa,et al.  Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[41]  C. J. van Rijsbergen,et al.  Semantic Spaces: Measuring the Distance between Different Subspaces , 2009, QI.

[42]  片山 徹 Subspace methods for system identification , 2005 .

[43]  Jian-Feng Cai,et al.  Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration , 2013, 2013 IEEE International Conference on Computer Vision.

[44]  Daniel D. Lee,et al.  Grassmann discriminant analysis: a unifying view on subspace-based learning , 2008, ICML '08.

[45]  Hermann Ney,et al.  Experiments with linear feature extraction in speech recognition , 1995, EUROSPEECH.

[46]  Sabato Marco Siniscalchi,et al.  Boosting universal speech attributes classification with deep neural network for foreign accent characterization , 2015, INTERSPEECH.

[47]  Visar Berisha,et al.  Accent Identification by Combining Deep Neural Networks and Recurrent Neural Networks Trained on Long and Short Term Features , 2016, INTERSPEECH.

[48]  Hsin-Min Wang,et al.  Subspace-Based Feature Representation and Learning for Language Recognition , 2012, INTERSPEECH.

[49]  Pavel Mat PCA-based Feature Extraction for Phonotactic Language Recognition , 2010 .

[50]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[51]  Hari Krishna Vydana,et al.  Significance of neural phonotactic models for large-scale spoken language identification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[52]  Ice B. Risteski,et al.  Principal Values and Principal Subspaces of Two Subspaces of Vector Spaces with Inner Product , 2000 .

[53]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[54]  Luis Javier Rodríguez-Fuentes,et al.  Improved Modeling of Cross-Decoder Phone Co-Occurrences in SVM-Based Phonotactic Language Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Jan Cernocký,et al.  Phonotactic Language Recognition using i-vectors and Phoneme Posteriogram Counts , 2012, INTERSPEECH.

[56]  N. Trendafilov,et al.  The Orthogonally Constrained Regression Revisited , 2001 .

[57]  Mireia Díez,et al.  On the use of phone log-likelihood ratios as features in spoken language recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[58]  Yonghong Yan,et al.  An approach to automatic language identification based on language-dependent phone recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[59]  Haizhou Li,et al.  Language Identification: A Tutorial , 2011, IEEE Circuits and Systems Magazine.

[60]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[61]  Ken-ichi Maeda,et al.  Face recognition using temporal image sequence , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[62]  Hsin-Min Wang,et al.  Subspace-based phonotactic language recognition using multivariate dynamic linear models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[63]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[64]  Dacheng Tao,et al.  Subspaces Indexing Model on Grassmann Manifold for Image Search , 2011, IEEE Transactions on Image Processing.

[65]  William M. Campbell,et al.  Language recognition with discriminative keyword selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  Javier Macías Guarasa,et al.  Language identification based on n-gram frequency ranking , 2007, INTERSPEECH.

[67]  T. Kohonen,et al.  The subspace learning algorithm as a formalism for pattern recognition and neural networks , 1988, IEEE 1988 International Conference on Neural Networks.

[68]  Bart De Moor,et al.  Subspace angles between ARMA models , 2002, Syst. Control. Lett..

[69]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[71]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[72]  Eddie Wong,et al.  Methods to improve Gaussian mixture model based language identification system , 2002, INTERSPEECH.

[73]  MOODY T. CHU ON THE STATISTICAL MEANING OF TRUNCATED SINGULAR VALUE DECOMPOSITION , 2004 .

[74]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[75]  Mario Lezcano Casado,et al.  Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group , 2019, ICML.

[76]  Chin-Hui Lee,et al.  Exploring universal attribute characterization of spoken languages for spoken language recognition , 2009, INTERSPEECH.

[77]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[78]  Kong Aik Lee,et al.  Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[79]  Lirong Dai,et al.  Deep Bottleneck Features for Spoken Language Identification , 2014, PloS one.