Spoken language recognition based on gap-weighted subsequence kernels

Abstract Phone recognizers followed by vector space models (PR-VSM) is a state-of-the-art phonotactic method for spoken language recognition. This method resorts to a bag-of- n -grams, with each dimension of the super vector based on the counts of n -gram tokens. The n -gram cannot capture the long-context co-occurrence relations due to the restriction of gram order. Moreover, it is vulnerable to the errors induced by the frontend phone recognizer. In this paper, we introduce a gap-weighted subsequence kernel (GWSK) method to overcome the drawbacks of n -gram. GWSK counts the co-occurrence of the tokens in a non-contiguous way and thus is not only error-tolerant but also capable of revealing the long-context relations. Beyond this, we further propose a truncated GWSK with constraints on context length in order to remove the interference from remote tokens and lower the computational cost, and extend the idea to lattices to take the advantage of multiple hypotheses from the phone recognizer. In addition, we investigate the optimal parameter setting and computational complexity of the proposed methods. Experiments on NIST 2009 LRE evaluation corpus with several configurations show that the proposed GWSK is consistently more effective than the PR-VSM approach.

[1]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[2]  Jihoon Yang,et al.  Walk-weighted subsequence kernels for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[3]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[6]  Marc A. Zissman,et al.  Automatic Language Identification , 2007 .

[7]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[8]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[9]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[10]  William M. Campbell,et al.  Language Recognition with Word Lattices and Support Vector Machines , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Jirí Navrátil,et al.  Double bigram-decoding in phonotactic language identification , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Jirí Navrátil,et al.  Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[13]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  William M. Campbell,et al.  Experiments with Lattice-based PPRLM Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[15]  Marc A. Zissman,et al.  Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Rong Tong,et al.  A Target-Oriented Phonotactic Front-End for Spoken Language Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[18]  William M. Campbell,et al.  Advanced Language Recognition using Cepstra and Phonotactics: MITLL System Performance on the NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[19]  Juho Rousu,et al.  Efficient Computation of Gapped Substring Kernels on Large Alphabets , 2005, J. Mach. Learn. Res..

[20]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[21]  Luis Javier Rodríguez-Fuentes,et al.  Improved Modeling of Cross-Decoder Phone Co-Occurrences in SVM-Based Phonotactic Language Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Bingxi Wang,et al.  Automatic Language Identification using Support Vector Machines , 2006, 2006 8th international Conference on Signal Processing.

[24]  Victor Zue,et al.  Automatic language identification using a segment-based approach , 1993, EUROSPEECH.

[25]  Shaomin Mu,et al.  Efficient computations of gapped string kernels based on suffix kernel , 2008, Neurocomputing.

[26]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[27]  Calvin Nkadimeng Language Identification Using Gaussian Mixture Models , 2010 .

[28]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[29]  William M. Campbell,et al.  High-level speaker verification with support vector machines , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[31]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[32]  Rong Tong,et al.  Spoken Language Recognition Using Ensemble Classifiers , 2007, IEEE Transactions on Audio, Speech, and Language Processing.