Query-by-Example Search with Multi-view Recurrent Auto-Encoder Representation

Query-by-example (QbE) speech search is the task of matching spoken queries with speech recordings in the search collection. With the rapid development of human-computer voice interaction technology, higher requirements are put forward for the search of spoken queries. In low-resource or zero-resource settings, QbE speech search usually uses dynamic time warping (DTW) to compare spoken queries and speech segments in the search collection. Recent studies have found that methods based on acoustic word embedding can not only improve search performance but also increase search speed. In this paper, we combine the autoencoder and multi-view method, and propose a weakly supervised method to train a multi-view recurrent autoencoder (RAE) model. This model can represent spoken queries and speech segments in the search collection as fixed dimensional vectors, and then find the matching speech segments by nearest neighbor search.

[1]  Aren Jansen,et al.  Indexing Raw Acoustic Features for Scalable Zero Resource Search , 2012, INTERSPEECH.

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Hung-yi Lee,et al.  Query-by-Example Spoken Term Detection Using Attention-Based Multi-Hop Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Karen Livescu,et al.  Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings , 2020, ArXiv.

[14]  Karen Livescu,et al.  Multilingual Jointly Trained Acoustic and Written Word Embeddings , 2020, INTERSPEECH.

[15]  Lianhong Cai,et al.  Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection , 2018, INTERSPEECH.

[16]  Karen Livescu,et al.  Acoustic span embeddings for multilingual query-by-example search , 2020, ArXiv.

[17]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[18]  Junjie Wang,et al.  Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection , 2020, ArXiv.

[19]  Karen Livescu,et al.  Discriminative acoustic word embeddings: Tecurrent neural network-based approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[20]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Hung-yi Lee,et al.  Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Recurrent Neural Networks , 2016 .

[22]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[23]  Karen Livescu,et al.  Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[24]  Hoirin Kim,et al.  Additional Shared Decoder on Siamese Multi-View Encoders for Learning Acoustic Word Embeddings , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[27]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[28]  Karen Livescu,et al.  Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.