Language Recognition Using Latent Dynamic Conditional Random Field Model with Phonological Features

Spoken language recognition (SLR) has been of increasing interest in multilingual speech recognition for identifying the languages of speech utterances. Most existing SLR approaches apply statistical modeling techniques with acoustic and phonotactic features. Among the popular approaches, the acoustic approach has become of greater interest than others because it does not require any prior language-specific knowledge. Previous research on the acoustic approach has shown less interest in applying linguistic knowledge; it was only used as supplementary features, while the current state-of-the-art system assumes independency among features. This paper proposes an SLR system based on the latent-dynamic conditional random field (LDCRF) model using phonological features (PFs). We use PFs to represent acoustic characteristics and linguistic knowledge. The LDCRF model was employed to capture the dynamics of the PFs sequences for language classification. Baseline systems were conducted to evaluate the features and methods including Gaussian mixture model (GMM) based systems using PFs, GMM using cepstral features, and the CRF model using PFs. Evaluated on the NIST LRE 2007 corpus, the proposed method showed an improvement over the baseline systems. Additionally, it showed comparable result with the acoustic system based on -vector. This research demonstrates that utilizing PFs can enhance the performance.

[1]  Lukás Burget,et al.  BUT language recognition system for NIST 2007 evaluations , 2008, INTERSPEECH.

[2]  Eliathamby Ambikairajah,et al.  Language Identification using Warping and the Shifted Delta Cepstrum , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[3]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[4]  Lukás Burget,et al.  Discriminative Training Techniques for Acoustic Language Identification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  David Harwath Phonetic Landmark Detection for Automatic Language Identification , 2010 .

[7]  Eric Fosler-Lussier,et al.  Conditional Random Fields for Integrating Local Discriminative Classifiers , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Bin Ma,et al.  Shifted-Delta MLP Features for Spoken Language Recognition , 2013, IEEE Signal Processing Letters.

[9]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[10]  Douglas E. Sturim,et al.  Eigen-channel compensation and discriminatively trained Gaussian mixture models for dialect and accent recognition , 2008, INTERSPEECH.

[11]  Supphanat Kanokphara,et al.  Comparative Study: HMM and SVM for Automatic Articulatory Feature Extraction , 2006, IEA/AIE.

[12]  Lukás Burget,et al.  iVector-based prosodic system for language identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[14]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[16]  Dong Yu,et al.  Language recognition using deep-structured conditional random fields , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Pietro Laface,et al.  Acoustic language identification using fast discriminative training , 2007, INTERSPEECH.

[18]  Chin-Hui Lee,et al.  Towards knowledge-based features for HMM based large vocabulary automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Haizhou Li,et al.  A GMM-supervector approach to language recognition with adaptive relevance factor , 2010, 2010 18th European Signal Processing Conference.

[20]  Michele Risi,et al.  Sketched Symbol Recognition with a Latent-Dynamic Conditional Model , 2010, 2010 20th International Conference on Pattern Recognition.

[21]  Jinyu Li,et al.  On designing and evaluating speech event detectors , 2005, INTERSPEECH.

[22]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[24]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.

[25]  Michael Picheny,et al.  Articulatory feature detection with Support Vector Machines for integration into ASR and phone recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Kim-Yung Eddie Wong Automatic spoken language identification utilizing acoustic and phonetic speech information , 2004 .

[28]  Ariadna J Quattoni Object Recognition with Latent Conditional Random Fields , 2005 .

[29]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[31]  Calvin Nkadimeng Language Identification Using Gaussian Mixture Models , 2010 .

[32]  Dong Yu,et al.  Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[33]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[35]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[37]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[38]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[39]  William M. Campbell A covariance kernel for svm language recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.