Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

With the explosive development of human-computer speech interaction, spoken term detection is widely required and has attracted increasing interest. In this paper, we propose a weak supervised approach using Siamese recurrent auto-encoder (RAE) to represent speech segments for query-by-example spoken term detection (QbyE-STD). The proposed approach exploits word pairs that contain different instances of the same/different word content as input to train the Siamese RAE. The encoder last hidden state vector of Siamese RAE is used as the feature for QbyE-STD, which is a fixed dimensional embedding feature containing mostly semantic content related information. The advantages of the proposed approach are: 1) extracting more compact feature with fixed dimension while keeping the semantic information for STD; 2) the extracted feature can describe the sequential phonetic structure of similar sounds to degree, which can be applied for zero-resource QbyESTD. Evaluations on real scene Chinese speech interaction data and TIMIT confirm the effectiveness and efficiency of the proposed approach compared to the conventional ones.

[1]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Murat Saraclar,et al.  Similarity Learning Based Query Modeling for Keyword Search , 2017, INTERSPEECH.

[4]  Bin Ma,et al.  Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[6]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[7]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, ICASSP.

[8]  Ji Wu,et al.  A Rescoring Approach for Keyword Search Using Lattice Context Information , 2017, INTERSPEECH.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Bin Ma,et al.  Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection , 2016, INTERSPEECH.

[11]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).