Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Recently, neural network technology has shown remarkable progress in speech recognition, including word classification, emotion recognition, and identity recognition. This paper introduces three novel speaker recognition methods to improve accuracy. The first method, called long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL), utilizes MFCC as input features for the LSTM model and incorporates triplet loss and cluster training for effective training. The second method, bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL), enhances speaker recognition accuracy by employing a bidirectional LSTM model. The third method, bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL), utilizes an autoencoder to extract additional AE features, which are then concatenated with MFCC and fed into the BLSTM model. The results showed that the performance of the BLSTM model was superior to the LSTM model, and the method of adding AE features achieved the best learning effect. Moreover, the proposed methods exhibit faster computation times compared to the reference GMM-HMM model. Therefore, utilizing pre-trained autoencoders for speaker encoding and obtaining AE features can significantly enhance the learning performance of speaker recognition. Additionally, it also offers faster computation time compared to traditional methods.

[1]  John H. L. Hansen,et al.  Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Mustafa A. Qamhan,et al.  Spoken Language Identification System Using Convolutional Recurrent Neural Network , 2022, Applied Sciences.

[3]  Jie Du,et al.  Novel Efficient RNN and LSTM-Like Architectures: Recurrent and Gated Broad Learning Systems and Their Applications for Text Classification , 2020, IEEE Transactions on Cybernetics.

[4]  David Sánchez-Rodríguez,et al.  Acoustic Classification of Singing Insects Based on MFCC/LFCC Fusion , 2019, Applied Sciences.

[5]  Ruiyu Liang,et al.  Speech Emotion Classification Using Attention-Based LSTM , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Sarmad Hussain,et al.  Native Language Identification in Very Short Utterances Using Bidirectional Long Short-Term Memory Network , 2019, IEEE Access.

[7]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.

[8]  Suleyman S. Kozat,et al.  Online Training of LSTM Networks in Distributed Systems for Variable Length Data Sequences , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[10]  Misko Subotic,et al.  Whispered speech recognition using deep denoising autoencoder , 2017, Eng. Appl. Artif. Intell..

[11]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[12]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[13]  Goutam Saha,et al.  A Novel Windowing Technique for Efficient Computation of MFCC for Speaker Recognition , 2012, IEEE Signal Processing Letters.

[14]  Robert X. Gao,et al.  Prognosis of Defect Propagation Based on Recurrent Neural Networks , 2011, IEEE Transactions on Instrumentation and Measurement.

[15]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[16]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[17]  J.P. Campbell,et al.  Forensic speaker recognition , 2009, IEEE Signal Processing Magazine.

[18]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[19]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[20]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[22]  P. Danielsson Euclidean distance mapping , 1980 .

[23]  Chung-Hsien Wu,et al.  Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Tsung-Han Tsai,et al.  Self-Defined Text-Dependent Wake-Up-Words Speaker Recognition System , 2021, IEEE Access.

[25]  Jungpil Shin,et al.  A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities , 2021, IEEE Access.

[26]  Muhammad Sajjad,et al.  Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM , 2020, IEEE Access.

[27]  V. Ravi Kumar,et al.  Significance of GMM-UBM based Modelling for Indian Language Identification☆ , 2015 .

[28]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..