DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition

Speaker recognition is a biometric modality that utilizes the speaker's speech segments to recognize the identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of the i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders are used to get the abstract extraction of the i-vector instead of applying PLDA. After pre-processing and feature extraction, the speaker and channel-independent speeches are employed for UBM training. The UBM is then used to extract the i-vector of the enrollment and test speech. Unlike the traditional i-vector framework, which uses linear discriminant analysis (LDA) to reduce dimension and increase the discrimination between speaker subspaces, this research use stacked auto-encoders to reconstruct the i-vector with lower dimension and different classifiers can be chosen to achieve final classification. The experimental results show that the proposed method achieves better performance than the state-of-the-art method.

[1]  Chunyan Zeng,et al.  Multiple Learning Features–Enhanced Knowledge Tracing Based on Learner–Resource Response Channels , 2023, Sustainability.

[2]  Chunyan Zeng,et al.  Digital Audio Tampering Detection Based on Deep Temporal-Spatial Features of Electrical Network Frequency , 2023, Inf..

[3]  Chunyan Zeng,et al.  Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms , 2023, Entropy.

[4]  Zhifeng Wang,et al.  A Unified Interpretable Intelligent Learning Diagnosis Framework for Learning Performance Prediction in Intelligent Tutoring Systems , 2023, Int. J. Intell. Syst..

[5]  Zhifeng Wang,et al.  Calibrated Q-Matrix-Enhanced Deep Knowledge Tracing with Relational Attention Mechanism , 2023, Applied Sciences.

[6]  Longlong Li,et al.  GBH-YOLOv5: Ghost Convolution with BottleneckCSP and Tiny Target Prediction Head Incorporating YOLOv5 for PV Panel Defect Detection , 2023, Electronics.

[7]  Zhifeng Wang,et al.  Smart Contract Vulnerability Detection for Educational Blockchain Based on Graph Neural Networks , 2022, 2022 International Conference on Intelligent Education and Intelligent Research (IEIR).

[8]  Chunyan Zeng,et al.  YOLOv5 Enhanced Learning Behavior Recognition and Analysis in Smart Classroom with Multiple Students , 2022, 2022 International Conference on Intelligent Education and Intelligent Research (IEIR).

[9]  Xiangkui Wan,et al.  High-Quality Image Compressed Sensing and Reconstruction with Multi-scale Dilated Convolutional Neural Network , 2022, Circuits, Systems, and Signal Processing.

[10]  Liting Lyu,et al.  Deep Knowledge Tracing Based on Spatial and Temporal Representation Learning for Learning Performance Prediction , 2022, Applied Sciences.

[11]  Zhifeng Wang,et al.  Abs-CAM: a gradient optimization interpretable approach for explanation of convolutional neural networks , 2022, Signal, Image and Video Processing.

[12]  Chunyan Zeng,et al.  Audio Tampering Forensics Based on Representation Learning of ENF Phase Sequence , 2022, Int. J. Digit. Crime Forensics.

[13]  Zhifeng Wang,et al.  Shallow and deep feature fusion for digital audio tampering detection , 2021, EURASIP J. Adv. Signal Process..

[14]  Zhenghui Wang,et al.  Image Compressed Sensing and Reconstruction of Multi-Scale Residual Network Combined with Channel Attention Mechanism , 2021, Journal of Physics: Conference Series.

[15]  Minghu Wu,et al.  Spatial and temporal learning representation for end-to-end recording device identification , 2021, EURASIP J. Adv. Signal Process..

[16]  Minghu Wu,et al.  Cascade neural network-based joint sampling and reconstruction for image compressed sensing , 2021, Signal, Image and Video Processing.

[17]  Minghu Wu,et al.  Cascade neural network-based joint sampling and reconstruction for image compressed sensing , 2021, Signal, Image and Video Processing.

[18]  Zhifeng Wang,et al.  SAE based unified double JPEG compression detection system for Web image forensics , 2021, Int. J. Web Inf. Syst..

[19]  Xinguo Yu,et al.  Robust Speaker Identification of IoT based on Stacked Sparse Denoising Auto-encoders , 2020, 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics).

[20]  Chunyan Zeng,et al.  Image Reconstruction of IoT based on Parallel CNN , 2020, 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics).

[21]  Zhifeng Wang,et al.  Robust Speaker Recognition Based on Stacked Auto-encoders , 2020, NBiS.

[22]  Nan Zhao,et al.  An end-to-end deep source recording device identification system for Web media forensics , 2020, Int. J. Web Inf. Syst..

[23]  Qiusha Min,et al.  An Evaluation of HTML5 and WebGL for Medical Imaging Applications , 2018, Journal of healthcare engineering.

[24]  Jing Wang,et al.  Digital Audio Tampering Detection Based on ENF Consistency , 2018, 2018 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR).

[25]  Chun-Yan Zeng,et al.  Stacked Autoencoder Networks Based Speaker Recognition , 2018, 2018 International Conference on Machine Learning and Cybernetics (ICMLC).

[26]  Jia Chen,et al.  Occlusion handling using moving volume and ray casting techniques for augmented reality systems , 2018, Multimedia Tools and Applications.

[27]  Qiusha Min,et al.  Integrating a cloud learning environment into English-medium instruction to enhance non-native English-speaking students’ learning , 2018, Innovations in Education and Teaching International.

[28]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[29]  Qiusha Min,et al.  Double compression detection based on feature fusion , 2017, 2017 International Conference on Machine Learning and Cybernetics (ICMLC).

[30]  Jia Chen,et al.  Recording Source Identification Using Device Universal Background Model , 2015, 2015 International Conference of Educational Innovation through Technology (EITT).

[31]  Jia Chen,et al.  Virtual Chime-Bells Experimental System Based on Multi-modal Fusion , 2015, 2015 International Conference of Educational Innovation through Technology (EITT).

[32]  Li-Rong Dai,et al.  Deep bottleneck network based i-vector representation for language identification , 2015, INTERSPEECH.

[33]  Longbiao Wang,et al.  Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification , 2015, EURASIP J. Audio Speech Music. Process..

[34]  Sébastien Marcel,et al.  Audio-visual gender recognition in uncontrolled environment using variability modeling techniques , 2014, IEEE International Joint Conference on Biometrics.

[35]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[36]  Björn W. Schuller,et al.  Modeling gender information for emotion recognition using Denoising autoencoder , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Zhifeng Wang,et al.  Liveness detection using time drift between lip movement and voice , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[38]  Gang Wei,et al.  Channel pattern noise based playback attack detection algorithm for speaker recognition , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[39]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Elmar Nöth,et al.  Age and gender recognition for telephone applications based on GMM supervectors and support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[42]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[43]  William M. Campbell,et al.  Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[45]  Yao Yang,et al.  Deep and Shallow Feature Fusion and Recognition of Recording Devices Based on Attention Mechanism , 2020, INCoS.

[46]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[47]  Mitchell McLaren,et al.  Source-normalised LDA for robust speaker recognition using i-vectors , 2011 .

[48]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[49]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[50]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[51]  Jean-Luc Gauvain,et al.  A phone-based approach to non-linguistic speech feature identification , 1995, Comput. Speech Lang..