Using cross-decoder co-occurrences of phone n-grams in SVM-based phonotactic language recognition

Most common approaches to phonotactic language recognition deal with several independent phone decoders. Decodings are processed and scored in a fully uncoupled way, their time alignment (and the information that may be extracted from it) being completely lost. Recently, we have presented a new approach to phonotactic language recognition which takes into account time alignment information, by considering cross-decoder cooccurrences of phones or phone n-grams at the frame level. Experiments on the NIST LRE2007 database demonstrated that using co-occurrence statistics could improve the performance of baseline phonotactic recognizers. In this work, the approach based on cross-decoder co-occurrences of phone n-grams is further developed and evaluated. Systems were built by means of open software (Brno University of Technology phone decoders, LIBLINEAR and FoCal) and experiments were carried out on the NIST LRE2007 database. A system based on cooccurrences of phone n-grams (up to 4-grams) outperformed the baseline phonotactic system, yielding around 8% relative improvement in terms of EER. The best fused system attained 1,90% EER (a 16% improvement with regard to the baseline system), which supports the use of cross-decoder dependencies for improved language modeling.

[1]  Jean-Luc Gauvain,et al.  Discriminative Classifiers for Language Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[3]  Luis Javier Rodríguez-Fuentes,et al.  Using cross-decoder phone coocurrences in phonotactic language recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[6]  William M. Campbell,et al.  Language recognition with discriminative keyword selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[8]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Douglas A. Reynolds,et al.  Combining cross-stream and time dimensions in phonetic speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[11]  Luis Javier Rodríguez-Fuentes,et al.  Improved Modeling of Cross-Decoder Phone Co-Occurrences in SVM-Based Phonotactic Language Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.