Pruning subsequence search with attention-based embedding

Searching a large database to find a sequence that is most similar to a query can be prohibitively expensive, particularly if individual sequence comparisons involve complex operations such as warping. To achieve scalability, "pruning" heuristics are typically employed to minimize the portion of the database that must be searched with more complex matching. We present an approximate pruning technique which involves embedding sequences in a Euclidean space. Sequences are embedded using a convolutional network with a form of attention that integrates over time, trained on matching and non-matching pairs of sequences. By using fixed-length embeddings, our pruning method effectively runs in constant time, making it many orders of magnitude faster than full dynamic time warping-based matching for large datasets. We demonstrate our approach on a large-scale musical score-to-audio recording retrieval task.

[1]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Daniel P. W. Ellis,et al.  Large-Scale Content-Based Matching of MIDI and Audio Files , 2015, ISMIR.

[3]  Simon Dixon,et al.  Sequential Complexity as a Descriptor for Musical Similarity , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[5]  Colin Raffel,et al.  librosa: v0.4.0 , 2015 .

[6]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[7]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[8]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[9]  Andreas Rauber,et al.  Facilitating Comprehensive Benchmarking Experiments on the Million Song Dataset , 2012, ISMIR.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[12]  Stephen Cranefield,et al.  A Study on Feature Analysis for Musical Instrument Classification , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[14]  Daniel P. W. Ellis,et al.  Song-Level Features and Support Vector Machines for Music Classification , 2005, ISMIR.

[15]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[19]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[21]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[22]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[23]  Dimitrios Gunopulos,et al.  Embedding-based subsequence matching in time-series databases , 2011, TODS.

[24]  Volume Assp,et al.  ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[25]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[26]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[27]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.