论文信息 - Locally-connected and convolutional neural networks for small footprint speaker recognition

Locally-connected and convolutional neural networks for small footprint speaker recognition

This work compares the performance of deep LocallyConnected Networks (LCN) and Convolutional Neural Networks (CNN) for text-dependent speaker recognition. These topologies model the local time-frequency correlations of the speech signal better, using only a fraction of the number of parameters of a fully connected Deep Neural Network (DNN) used in previous works. We show that both a LCN and CNN can reduce the total model footprint to 30% of the original size compared to a baseline fully-connected DNN, with minimal impact in performance or latency. In addition, when matching parameters, the LCN improves speaker verification performance, as measured by equal error rate (EER), by 8% relative over the baseline without increasing model size or computation. Similarly, a CNN improves EER by 10% relative over the baseline for the same model size but with increased computation.

[1] Francoise Beaufays,et al. “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[2] Yun Lei,et al. Application of Convolutional Neural Networks to Language Identification in Noisy Conditions , 2014, Odyssey.

[3] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Themos Stafylakis,et al. Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[5] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[6] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[7] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[8] Yun Lei,et al. Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[9] Y. LeCun,et al. Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[10] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11] Tara N. Sainath,et al. Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Andrew Y. Ng,et al. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[13] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Yann LeCun,et al. What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15] Dong Yu,et al. Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Hagai Aronowitz,et al. New Developments in Voice Biometrics for User Authentication , 2011, INTERSPEECH.