Hypersphere Embedding and Additive Margin for Query-by-example Keyword Spotting

Query-by-example (QbE) keyword spotting is convenient for users to define their own keywords, so it is useful in device control. However, conventional regular softmax, which is commonly used for training QbE models, has two limitations. First, the learned features are not discriminative enough. Second, norm variations of the unnormalized features affect computing cosine similarities. To address these issues, this paper introduces normalization and additive margin into residual networks for QbE keyword spotting. Features and weights are normalized on a hypersphere of fixed radius. Additive margin further helps to reduce the intra-class variations and increase inter-class differences. Based on public datasets AISHELL-1 and HelloNPU, we design three different test sets, namely in-vocabulary, out-of-vocabulary, and cross-corpus, to evaluate our proposed method. Experiments show that our proposed method can learn more discriminative embedding features. For totally unseen situation, our proposed method achieves a relative false rejection rate reduction of 46.60% when the false alarm rate is 2% in cross-corpus evaluation, compared with regular softmax.

[1]  Song Bai,et al.  Triplet-Center Loss for Multi-view 3D Object Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Marios Savvides,et al.  Ring Loss: Convex Feature Normalization for Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Jian Cheng,et al.  NormFace: L2 Hypersphere Embedding for Face Verification , 2017, ACM Multimedia.

[4]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Zhifeng Xie,et al.  ResNet and Model Fusion for Automatic Spoofing Detection , 2017, INTERSPEECH.

[7]  Bin Ma,et al.  Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search , 2018, INTERSPEECH.

[8]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hung-yi Lee,et al.  Query-by-Example Spoken Term Detection Using Attention-Based Multi-Hop Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Karen Livescu,et al.  Discriminative acoustic word embeddings: Tecurrent neural network-based approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[12]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Aren Jansen,et al.  Weak top-down constraints for unsupervised acoustic model training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kishore Prahallad,et al.  Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Lei Xie,et al.  HelloNPU: A Corpus for Small-Footprint Wake-Up Word Detection Research , 2017 .

[21]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[22]  Xavier Anguera Miró,et al.  Memory efficient subsequence DTW for Query-by-Example Spoken Term Detection , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.