Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation

Conventional methods for finding audio in databases typically search text labels, rather than the audio itself. This can be problematic as labels may be missing, irrelevant to the audio content, or not known by users. Query by vocal imitation lets users query using vocal imitations instead. To do so, appropriate audio feature representations and effective similarity measures of imitations and original sounds must be developed. In this paper, we build upon our preliminary work to propose Siamese style convolutional neural networks to learn feature representations and similarity measures in a unified end-to-end training framework. Our Siamese architecture uses two convolutional neural networks to extract features, one from vocal imitations and the other from original sounds. The encoded features are then concatenated and fed into a fully connected network to estimate their similarity. We propose two versions of the system: IMINET is symmetric where the two encoders have an identical structure and are trained from scratch, while TL-IMINET is asymmetric and adopts the transfer learning idea by pretraining the two encoders from other relevant tasks: spoken language recognition for the imitation encoder and environmental sound classification for the original sound encoder. Experimental results show that both versions of the proposed system outperform a state-of-the-art system for sound search by vocal imitation, and the performance can be further improved when they are fused with the state of the art system. Results also show that transfer learning significantly improves the retrieval performance. This paper also provides insights to the proposed networks by visualizing and sonifying input patterns that maximize the activation of certain neurons in different layers.

[1]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[2]  C. Gross Genealogy of the “Grandmother Cell” , 2002, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[3]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[4]  Zafar Rafii,et al.  An audio fingerprinting system for live version identification using image processing techniques , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Gaël Richard,et al.  Drum Loops Retrieval from Spoken Queries , 2005, Journal of Intelligent Information Systems.

[6]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Zhiyao Duan,et al.  IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hwee Tou Ng,et al.  A lattice-based approach to query-by-example spoken document retrieval , 2008, SIGIR '08.

[9]  Jordi Janer,et al.  Sound Retrieval From Voice Imitation Queries In Collaborative Databases , 2014, Semantic Audio.

[10]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[11]  Aaron C. Courville,et al.  Understanding Representations Learned in Deep Architectures , 2010 .

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[14]  Thierry Bertin-Mahieux,et al.  Large-scale cover song recognition using hashed chroma landmarks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Xavier Serra,et al.  Querying Freesound with a microphone , 2015 .

[16]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[17]  George Tzanetakis,et al.  A comparative evaluation of search techniques for query-by-humming using the MUSART testbed , 2007 .

[18]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[19]  Patrick Susini,et al.  The Timbre Toolbox: extracting audio descriptors from musical signals. , 2011, The Journal of the Acoustical Society of America.

[20]  Bryan Pardo,et al.  VocalSketch: Vocally Imitating Audio Concepts , 2015, CHI.

[21]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[22]  Matias Lindgren,et al.  Deep learning for spoken language identification , 2020 .

[23]  YICHI ZHANG,et al.  Supervised and Unsupervised Sound Retrieval by Vocal Imitation , 2016 .

[24]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[25]  Moshé M. Zloof Query-by-Example: A Data Base Language , 1977, IBM Syst. J..

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[29]  Zhiyao Duan,et al.  Retrieving sounds by vocal imitation recognition , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[30]  Ajay Kapur,et al.  Query-by-Beat-Boxing: Music Retrieval For The DJ , 2004, ISMIR.

[31]  Antoine Liutkus,et al.  A Multi-resolution approach to Common Fate-based audio separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Harris Wu,et al.  Evaluating Web-based Question Answering Systems , 2002, LREC.

[33]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[34]  Meinard Müller,et al.  Known Artist Live Song ID: A Hashprint Approach , 2016, ISMIR.

[35]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[36]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[37]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[38]  Zhiyao Duan,et al.  IMINET: Convolutional semi-siamese networks for sound search by vocal imitation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[39]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[41]  Tuomas Virtanen,et al.  Audio Query by Example Using Similarity Measures between Probability Density Functions of Features , 2010, EURASIP J. Audio Speech Music. Process..

[42]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[43]  G. Montavon Deep learning for spoken language identification , 2009 .

[44]  Zhiyao Duan,et al.  Visualization and Interpretation of Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).