A Robust Equalization Feature for Language Recognition

Wenjie Song, Chen Chen, Tianyang Sun, Wei Wang* School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China 64228128@qq.com, 794452272@qq.com, sun_tianyang@yahoo.com, wangwei_hitwh@126.com Abstract:The performance of language recognition system is mainly determined by feature extraction and model training. In this paper, a robust equalization feature for language recognition is proposed, which utilizes the common features of the speech spectrum mean vector to calculate a global mean vector. The spectrum mean vector of each segment is equalized on the global mean vector, and the equalization features are obtained. In model training, Gated Recurrent Unit (GRU) of Recurrent Neural Network (RNN) is applied to language recognition, in which GRU can reduce the amount of computation and shorten the training time. The experimental results show that the proposed method outperforms the baseline system on the NIST LRE 2007 corpus.

[1]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[2]  Srinivasan Umesh,et al.  Improved cepstral mean and variance normalization using Bayesian framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[4]  Alex Graves,et al.  Long Short-Term Memory , 2020, Computer Vision.

[5]  Najim Dehak,et al.  Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.

[6]  Hadi Veisi,et al.  The integration of principal component analysis and cepstral mean subtraction in parallel model combination for robust speech recognition , 2011, Digit. Signal Process..

[7]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[8]  B. Bharathi,et al.  Speaker verification in a noisy environment by enhancing the speech signal using various approaches of spectral subtraction , 2016, 2016 10th International Conference on Intelligent Systems and Control (ISCO).

[9]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Daniel Povey,et al.  Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[12]  Girija Chetty,et al.  A Comparative Study of Recognition of Speech Using Improved MFCC Algorithms and Rasta Filters , 2012, ICISTM.

[13]  Kong-Aik Lee,et al.  Deep Language: a comprehensive deep learning approach to end-to-end language recognition , 2016, Odyssey.

[14]  Wissam A. Jassim,et al.  A Robust Speaker Identification System Using the Responses from a Model of the Auditory Periphery , 2016, PloS one.

[15]  Vikas Joshi,et al.  Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates , 2016, Circuits Syst. Signal Process..

[16]  Jia Liu,et al.  Gated recurrent units based hybrid acoustic models for robust speech recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[17]  Benjamin Milde,et al.  Classification of Speaker Intoxication Using a Bidirectional Recurrent Neural Network , 2016, TSD.

[18]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[19]  Tae-Hyung Kim Training Method and Speaker Verification Measures for Recurrent Neural Network based Speaker Verification System , 2009 .