论文信息 - Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

This paper proposes two novel frontends for robust language identification (LID) using a convolutional neural network (CNN) trained for automatic speech recognition (ASR). In the CNN/i-vector frontend, the CNN is used to obtain the posterior probabilities for i-vector training and extraction instead of a universal background model (UBM). The CNN/posterior frontend is somewhat similar to a phonetic system in that the occupation counts of (tied) triphone states (senones) given by the CNN are used for classification. They are compressed to a low dimensional vector using probabilistic principal component analysis (PPCA). Evaluated on heavily degraded speech data, the proposed front ends provide significant improvements of up to 50% on average equal error rate compared to a UBM/i-vector baseline. Moreover, the proposed frontends are complementary and give significant gains of up to 20% relative to the best single system when combined.

[1] Yun Lei,et al. Improving language identification robustness to highly channel-degraded speech through multiple system fusion , 2013, INTERSPEECH.

[2] George Saon,et al. Neural network acoustic models for the DARPA RATS program , 2013, INTERSPEECH.

[3] Mireia Díez,et al. Study of Different Backends in a State-Of-the-Art Language Recognition System , 2012, INTERSPEECH.

[4] Yun Lei,et al. Adaptive Gaussian backend for robust language identification , 2013, INTERSPEECH.

[5] Kevin Walker,et al. The RATS radio traffic collection system , 2012, Odyssey.

[6] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[9] Marc A. Zissman,et al. Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[10] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Jan Cernocký,et al. Phonotactic Language Recognition using i-vectors and Phoneme Posteriogram Counts , 2012, INTERSPEECH.

[12] Pavel Matejka,et al. Description and analysis of the Brno276 system for LRE2011 , 2012, Odyssey.

[13] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14] 尚弘島影. National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[15] Yun Lei,et al. Effective use of DCTS for contextualizing features for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Yun Lei,et al. Factor Analysis Back Ends for MLLR Transforms in Speaker Recognition , 2011, INTERSPEECH.

[18] Pavel Matejka,et al. Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[19] James H. Elder,et al. Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21] William M. Campbell,et al. Experiments with Lattice-based PPRLM Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[22] Dong Yu,et al. Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[23] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[24] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .