Robust Language Recognition Based on Diverse Features

In real scenarios, robust language identification (LID) is usually hindered by factors such as background noise, channel, and speech duration mismatches. To address these issues, this study focuses on the advancements of diverse acoustic features, back-ends, and their influence on LID system fusion. There is little research about the selection of complementary features for a multiple system fusion in LID. A set of distinct features are considered, which can be grouped into three categories: classical features, innovative features, and extensional features. In addition, both front-end concatenation and back-end fusion are considered. The results suggest that no single feature type is universally vital across all LID tasks and that a fusion of a diverse set is needed to ensure sustained LID performance in challenging scenarios. Moreover, the back-end fusion also consistently enhances the system performance significantly. More specifically, the proposed hybrid fusion method improves system performance by +38.5% and +46.2% on the DARPA RATS and the NIST LRE09 data sets, respectively.

[1]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  John H. L. Hansen,et al.  A linguistic data acquisition front-end for language recognition evaluation , 2012, Odyssey.

[4]  M. A. Kohler,et al.  Language identification using shifted delta cepstra , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[5]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[6]  John H. L. Hansen,et al.  CRSS systems for 2012 NIST Speaker Recognition Evaluation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[10]  John H. L. Hansen,et al.  The CRSS systems for the 2010 NIST speaker recognition evaluation , 2010 .

[11]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  John H. L. Hansen,et al.  Detecting Sleepiness by Fusing Classifiers Trained with Novel Acoustic Features , 2011, INTERSPEECH.

[13]  John H. L. Hansen,et al.  An investigation on back-end for speaker recognition in multi-session enrollment , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  DeLiang Wang,et al.  Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  John H. L. Hansen,et al.  A systematic strategy for robust automatic dialect identification , 2011, 2011 19th European Signal Processing Conference.

[16]  Yun Lei,et al.  Dialect identification: Impact of differences between read versus spontaneous speech , 2010, 2010 18th European Signal Processing Conference.

[17]  John H. L. Hansen,et al.  Uncertainty propagation in front end factor analysis for noise robust speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Daniel Garcia-Romero,et al.  Linear versus mel frequency cepstral coefficients for speaker recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Mireia Díez,et al.  Study of Different Backends in a State-Of-the-Art Language Recognition System , 2012, INTERSPEECH.

[20]  Abeer Alwan,et al.  Multi-band summary correlogram-based pitch detection for noisy speech , 2013, Speech Commun..

[21]  William M. Campbell,et al.  Experiments with Lattice-based PPRLM Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[22]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Lukás Burget,et al.  iVector-based prosodic system for language identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yun Lei,et al.  A noise-robust system for NIST 2012 speaker recognition evaluation , 2013, INTERSPEECH.

[25]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[26]  John H. L. Hansen,et al.  Exploring Hilbert envelope based acoustic features in i-vector speaker verification using HT-PLDA , 2011 .

[27]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  John H. L. Hansen,et al.  Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  John H. L. Hansen,et al.  Automatic regularization of cross-entropy cost for speaker recognition fusion , 2013, INTERSPEECH.

[30]  Yun Lei,et al.  A novel feature extraction strategy for multi-stream robust emotion identification , 2010, INTERSPEECH.

[31]  Douglas E. Sturim,et al.  MITLL 2015 Language Recognition Evaluation System Description , 2016 .

[32]  Gang Liu,et al.  Robust speech enhancement techniques for ASR in non-stationary noise and dynamic environments , 2013, INTERSPEECH.

[33]  John H. L. Hansen,et al.  UTD-CRSS SYSTEMS FOR NIST LANGUAGE RECOGNITION EVALUATION 2011 , 2011 .

[34]  John H. L. Hansen,et al.  Supra-Segmental Feature Based Speaker Trait Detection , 2014, Odyssey.

[35]  John H. L. Hansen,et al.  I4u submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification , 2013, INTERSPEECH.

[36]  Haizhou Li,et al.  Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.