LID-Senones and Their Statistics for Language Identification

Recent research on end-to-end training structures for language identification has raised the possibility that intermediate language-sensitive feature units exist which are analogous to phonetically sensitive senones in automatic speech recognition systems. Termed language identification (LID)-senones, the statistics derived from these feature units have been shown to be beneficial in discriminating between languages, particularly for short utterances. This paper examines the evidence for the existence of LID-senones before designing and evaluating LID systems based on low- and high-level statistics of LID-senones with both generative and discriminative models. For the standard NIST LRE 2009 task on 23 languages, LID-senone-based systems are shown to outperform state-of-the-art deep neural network/i-vector methods both when LID-senones are used directly for classification and when LID-senone statistics are used for i-vector formation.

[1]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[3]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ian McLoughlin,et al.  Speech and Audio Processing: A MATLAB®-based Approach , 2016 .

[5]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[6]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[7]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[8]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[9]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[10]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Li-Rong Dai,et al.  LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.

[12]  Li-Rong Dai,et al.  Improved language identification using deep bottleneck network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[15]  Lirong Dai,et al.  Deep Bottleneck Features for Spoken Language Identification , 2014, PloS one.

[16]  Chin-Hui Lee,et al.  Universal attribute characterization of spoken languages for automatic spoken language recognition , 2013, Comput. Speech Lang..

[17]  Li-Rong Dai,et al.  Deep bottleneck network based i-vector representation for language identification , 2015, INTERSPEECH.

[18]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Li-Rong Dai,et al.  Task-aware deep bottleneck features for spoken language identification , 2014, INTERSPEECH.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[24]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[25]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[26]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.