On the complementarity of short-time fourier analysis windows of different lengths for improved language recognition

Previous works have shown that remarkable performance improvements can be attained in speaker and language recognition tasks by combining several heterogeneous systems that provide complementary information. In this work, the complementarity of several i-vector language recognition systems, using Mel-Frequency Cepstral-Coefficient (MFCC) features computed on ShortTime Fourier Analysis windows of different sizes, is studied. Language recognition experiments carried out on the NIST 2007 and 2009 LRE datasets reveal relative performance gains of up to 33% when fusing the systems, with regard to the best single system. Results suggest that combining acoustic systems based on analysis windows of different sizes may allow to get advantage from both the sharper characterization of short events provided by short windows and the better frequency resolution of stationary events provided by long windows.

[1]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[2]  Yun Lei,et al.  Improving language identification robustness to highly channel-degraded speech through multiple system fusion , 2013, INTERSPEECH.

[3]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Haizhou Li,et al.  Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Lukás Burget,et al.  iVector Fusion of Prosodic and Cepstral Features for Speaker Verification , 2011, INTERSPEECH.

[6]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[7]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[8]  Mireia Díez,et al.  On the use of phone log-likelihood ratios as features in spoken language recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Douglas D. O'Shaughnessy,et al.  Multitaper MFCC and PLP features for speaker verification using i-vectors , 2013, Speech Commun..

[10]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[12]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[13]  Mireia Díez,et al.  The BLZ Submission to the NIST 2011 LRE: Data Collection, System Development and Performance , 2012, INTERSPEECH.

[14]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[16]  Rong Tong,et al.  NIST 2007 Language Recognition Evaluation: From the Perspective of IIR , 2008, PACLIC.

[17]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[18]  Mireia Díez,et al.  Dimensionality reduction of phone log-likelihood ratio features for spoken language recognition , 2013, INTERSPEECH.