EfficientTDNN: Efficient Architecture Search for Speaker Recognition

Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approaches, neural architecture search (NAS) appears as a practical technique in automating the manual architecture design process and has attracted increasing interest in spoken language processing tasks such as speaker recognition. In this paper, we propose EfficientTDNN, an efficient architecture search framework consisting of a TDNNbased supernet and a TDNN-NAS algorithm. The proposed supernet introduces temporal convolution of different ranges of the receptive field and feature aggregation of various resolutions from different layers to TDNN. On top of it, the TDNN-NAS algorithm quickly searches for the desired TDNN architecture via weight-sharing subnets, which surprisingly reduces computation while handling the vast number of devices with various resources requirements. Experimental results on the VoxCeleb dataset show the proposed EfficientTDNN enables approximate 10 architectures concerning depth, kernel, and width. Considering different computation constraints, it achieves a 2.20% equal error rate (EER) with 204M multiply-accumulate operations (MACs), 1.41% EER with 571M MACs as well as 0.94% EER with 1.45G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.

[1]  Benjamin Barras,et al.  SoX : Sound eXchange , 2012 .

[2]  Mingxing Tan,et al.  EfficientNetV2: Smaller Models and Faster Training , 2021, ICML.

[3]  Zhangyang Wang,et al.  AutoSpeech: Neural Architecture Search for Speaker Recognition , 2020, INTERSPEECH.

[4]  Shuai Wang,et al.  Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[5]  Harsha Vardhan,et al.  The Leap Speaker Recognition System for NIST SRE 2018 Challenge , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[7]  Douglas A. Reynolds,et al.  The 2018 NIST Speaker Recognition Evaluation , 2019, INTERSPEECH.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[10]  Song Han,et al.  ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware , 2018, ICLR.

[11]  Yun Lei,et al.  Advances in deep neural network approaches to speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jing Xiao,et al.  Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification , 2020, INTERSPEECH.

[13]  Shuai Wang,et al.  Joint I-Vector with End-to-End System for Short Duration Text-Independent Speaker Verification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Kai Zhao,et al.  Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Witold Pedrycz,et al.  Linguistic models and linguistic modeling , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[16]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hung-yi Lee,et al.  DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation , 2020, INTERSPEECH.

[18]  Samin Ishtiaq,et al.  NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition , 2021, ICLR.

[19]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[20]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[21]  Dengxin Dai,et al.  Unified Hypersphere Embedding for Speaker Recognition , 2018, ArXiv.

[22]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[24]  Longhui Wei,et al.  Weight-Sharing Neural Architecture Search: A Battle to Shrink the Optimization Gap , 2020, ACM Comput. Surv..

[25]  Joon Son Chung,et al.  VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge , 2019, ArXiv.

[26]  Xiangyu Zhang,et al.  Single Path One-Shot Neural Architecture Search with Uniform Sampling , 2019, ECCV.

[27]  Pooyan Safari,et al.  Self-attention encoding and pooling for speaker recognition , 2020, INTERSPEECH.

[28]  Mathieu Salzmann,et al.  How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS , 2020, ArXiv.

[29]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[30]  Wu-Jun Li,et al.  Densely Connected Time Delay Neural Network for Speaker Verification , 2020, INTERSPEECH.

[31]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Zhijian Ou,et al.  Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[33]  Quoc V. Le,et al.  Understanding and Simplifying One-Shot Architecture Search , 2018, ICML.

[34]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[35]  Longbiao Wang,et al.  ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification , 2020, INTERSPEECH.

[36]  Ji Liu,et al.  SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification , 2021, ArXiv.

[37]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[38]  Joon Son Chung,et al.  Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020 , 2020, ArXiv.

[39]  Kris Demuynck,et al.  ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[40]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[41]  Witold Pedrycz,et al.  The design of cognitive maps: A study in synergy of granular computing and evolutionary optimization , 2010, Expert Syst. Appl..

[42]  Lawrence Carin,et al.  Learning Autoencoders with Relational Regularization , 2020, ICML.

[43]  Enhong Chen,et al.  Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[46]  Tan Lee,et al.  Text-Independent Speaker Verification with Dual Attention Network , 2020, INTERSPEECH.

[47]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Alan McCree,et al.  MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition , 2020, Odyssey.

[49]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.