论文信息 - Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

With the explosive development of human-computer speech interaction, spoken term detection is widely required and has attracted increasing interest. In this paper, we propose a weak supervised approach using Siamese recurrent auto-encoder (RAE) to represent speech segments for query-by-example spoken term detection (QbyE-STD). The proposed approach exploits word pairs that contain different instances of the same/different word content as input to train the Siamese RAE. The encoder last hidden state vector of Siamese RAE is used as the feature for QbyE-STD, which is a fixed dimensional embedding feature containing mostly semantic content related information. The advantages of the proposed approach are: 1) extracting more compact feature with fixed dimension while keeping the semantic information for STD; 2) the extracted feature can describe the sequential phonetic structure of similar sounds to degree, which can be applied for zero-resource QbyESTD. Evaluations on real scene Chinese speech interaction data and TIMIT confirm the effectiveness and efficiency of the proposed approach compared to the conventional ones.

Lianhong Cai | Runnan Li | Zhiyong Wu | Helen M. Meng | Ziwei Zhu

[1] Karen Livescu,et al. Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Murat Saraclar,et al. Similarity Learning Based Query Modeling for Keyword Search , 2017, INTERSPEECH.

[4] Bin Ma,et al. Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] S. Chiba,et al. Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[6] Jonas Mueller,et al. Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[7] Brian Kingsbury,et al. End-to-end ASR-free keyword search from speech , 2017, ICASSP.

[8] Ji Wu,et al. A Rescoring Approach for Keyword Search Using Lattice Context Information , 2017, INTERSPEECH.

[9] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[10] Bin Ma,et al. Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection , 2016, INTERSPEECH.

[11] Lin-Shan Lee,et al. Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[12] Tara N. Sainath,et al. Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).