Language/Dialect Recognition Based on Unsupervised Deep Learning

Over the past decade, bottleneck features within an i-Vector framework have been used for state-of-the-art language/dialect identification (LID/DID). However, traditional bottleneck feature extraction requires additional transcribed speech information. Alternatively, two types of unsupervised deep learning methods are introduced in this study. To address this limitation, an unsupervised bottleneck feature extraction approach is proposed, which is derived from the traditional bottleneck structure but trained with estimated phonetic labels. In addition, based on a generative modeling autoencoder, two types of latent variable learning algorithms are introduced for speech feature processing, which have been previous considered for image processing/reconstruction. Specifically, a variational autoencoder and adversarial autoencoder are utilized on alternative phase of speech processing. To demonstrate the effectiveness of the proposed methods, three corpora are evaluated: 1) a four Chinese dialect dataset, 2) a five Arabic dialect corpus, and 3) multigenre broadcast challenge corpus (MGB-3) for arabic DID. The proposed features are shown to outperform traditional acoustic feature MFCCs consistently across three corpora. Taken collectively, the proposed features achieve up to a relative +58% improvement in $C_{\text{avg}}$ for LID/DID without the need of any secondary speech corpora.

[1]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[4]  John H. L. Hansen,et al.  An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  John H. L. Hansen,et al.  Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[8]  John H. L. Hansen,et al.  Training candidate selection for effective rejection in open-set language identification , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[10]  Yu Zhang,et al.  Language ID-based training of multilingual stacked bottleneck features , 2014, INTERSPEECH.

[11]  Maryam Najafian,et al.  Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems , 2016, Odyssey.

[12]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[13]  Diederik P. Kingma,et al.  Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[14]  Lukás Burget,et al.  iVector-based prosodic system for language identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Aaron Lawson,et al.  Exploring the role of phonetic bottleneck features for speaker and language recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[18]  William M. Campbell,et al.  Experiments with Lattice-based PPRLM Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[19]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[20]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[21]  John H. L. Hansen,et al.  Arabic Dialect Identification - 'Is the Secret in the Silence?' and Other Observations , 2012, INTERSPEECH.

[22]  Jean-Luc Gauvain,et al.  Gaussian Backend design for open-set language detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  J. Hansen,et al.  Dialect Classification via Text-Independent Training and Testing for Arabic, Spanish, and Chinese , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[25]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[26]  John H. L. Hansen,et al.  Automatic analysis of dialect/language sets , 2015, Int. J. Speech Technol..

[27]  Bin Ma,et al.  The 2015 NIST Language Recognition Evaluation: The Shared View of I2R, Fantastic4 and SingaMS , 2016, INTERSPEECH.

[28]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[30]  Yu Zhang,et al.  Speaker adaptation using the i-vector technique for bottleneck features , 2015, INTERSPEECH.

[31]  John H. L. Hansen,et al.  Robust Language Recognition Based on Diverse Features , 2014, Odyssey.

[32]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[34]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[35]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.