论文信息 - The DKU System Description for The Interspeech 2021 Auto-KWS Challenge

The DKU System Description for The Interspeech 2021 Auto-KWS Challenge

This paper introduces the system submitted by the DKU-SMIIP team for the Auto-KWS 2021 Challenge. Our implementation consists of a two-stage keyword spotting system based on query-by-example spoken term detection and a speaker verification system. We employ two different detection algorithms in our proposed keyword spotting system. The first stage adopts subsequence dynamic time warping for template matching based on frame-level language-independent bottleneck feature and phoneme posterior probability. We use a sliding window template matching algorithm based on acoustic word embeddings to further verify the detection from the first stage. As a result, our KWS system achieves an average score of 0.61 on the feedback dataset, which outperforms the baseline1 system by 0.25.

Zexin Cai | Ming Li | Yechen Wang | Yan Jia | Murong Ma

[1] Martin Karafiát,et al. The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2] Ming Li,et al. Unsupervised query by example spoken term detection using features concatenated with Self-Organizing Map distances , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[3] Chng Eng Siong,et al. The NNI Query-by-Example System for MediaEval 2015 , 2014, MediaEval.

[4] Ming Li,et al. HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Sanjeev Khudanpur,et al. A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Vincent M. Stanford,et al. The 2021 NIST Speaker Recognition Evaluation , 2022, Odyssey.

[7] Joon Son Chung,et al. In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[8] Karen Livescu,et al. Discriminative acoustic word embeddings: Tecurrent neural network-based approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9] Koichi Shinoda,et al. Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[10] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[11] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[12] Karen Livescu,et al. Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[13] Xavier Anguera Miró,et al. Memory efficient subsequence DTW for Query-by-Example Spoken Term Detection , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[14] Alan Bundy,et al. Dynamic Time Warping , 1984 .

[15] Dong Wang,et al. THCHS-30 : A Free Chinese Speech Corpus , 2015, ArXiv.

[16] David Yarowsky,et al. Quantifying the value of pronunciation lexicons for keyword search in lowresource languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Dong Wang,et al. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[19] Aren Jansen,et al. Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Xing Ji,et al. CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Meinard Müller,et al. Information retrieval for music and motion , 2007 .

[22] Ming Li,et al. The 2020 Personalized Voice Trigger Challenge: Open Database, Evaluation Metrics and the Baseline Systems , 2021, 2101.01935.

[23] Junjie Wang,et al. Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection , 2020, ArXiv.

[24] Karen Livescu,et al. Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Tara N. Sainath,et al. Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Bin Ma,et al. Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection , 2014, INTERSPEECH.

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).