Dimensionality Reduction for Using High-Order n-Grams in SVM-Based Phonotactic Language Recognition

SVM-based phonotactic language recognition is state-of-the-art technology. However, due to computational bounds, phonotactic information is usually limited to low-order phone n-grams (up to n = 3). In a previous work, we proposed a feature selection algorithm, based on n-gram frequencies, which allowed us work successfully with high-order n-grams on the NIST 2007 LRE database. In this work, we use two feature projection methods for dimensionality reduction of feature spaces including up to 4-grams: Principal Component Analysis (PCA) and Random Projection. These methods allow us to attain competitive performance even for small feature sets (e.g. of size 500). Systems were built by means of open software (BUT phone decoders, HTK, SRILM, LIBLINEAR and FoCal) and experiments were carried out on the NIST 2009 LRE database. Best performance was attained by using the feature selection algorithm to get around 11500 features: 1.93% EER and CLLR = 0.413. When considering smaller sets of features, PCA provided best performance. For instance, using PCA to get a 500-dimensional feature subspace yielded 2.15% EER and CLLR = 0.457 (25% improvement with regard to using feature selection).

[1]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  William M. Campbell,et al.  Language Recognition with Word Lattices and Support Vector Machines , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Luis Javier Rodríguez-Fuentes,et al.  A dynamic approach to the selection of high order n-grams in phonotactic language recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[7]  Lukás Burget,et al.  Data selection and calibration issues in automatic language recognition - investigation with BUT-AGNITIO NIST LRE 2009 system , 2010, Odyssey.

[8]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[9]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[12]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[13]  William M. Campbell,et al.  Language recognition with discriminative keyword selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Mark Tygert,et al.  A Randomized Algorithm for Principal Component Analysis , 2008, SIAM J. Matrix Anal. Appl..

[15]  Selecting phonotactic features for language recognition , 2010, INTERSPEECH.

[16]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[17]  Lukás Burget,et al.  PCA-based Feature Extraction for Phonotactic Language Recognition , 2010, Odyssey.