Sound Search by Text Description or Vocal Imitation?

Searching sounds by text labels is often difficult, as text descriptions cannot describe the audio content in detail. Query by vocal imitation bridges such gap and provides a novel way to sound search. Several algorithms for sound search by vocal imitation have been proposed and evaluated in a simulation environment, however, they have not been deployed into a real search engine nor evaluated by real users. This pilot work conducts a subjective study to compare these two approaches to sound search, and tries to answer the question of which approach works better for what kinds of sounds. To do so, we developed two web-based search engines for sound, one by vocal imitation (Vroom!) and the other by text description (TextSearch). We also developed an experimental framework to host these engines to collect statistics of user behaviors and ratings. Results showed that Vroom! received significantly higher search satisfaction ratings than TextSearch did for sound categories that were difficult for subjects to describe by text. Results also showed a better overall ease-of-use rating for Vroom! than TextSearch on the limited sound library in our experiments. These findings suggest advantages of vocal-imitation-based search for sound in practice.

[1]  Bryan Pardo,et al.  VocalSketch: Vocally Imitating Audio Concepts , 2015, CHI.

[2]  Moshé M. Zloof Query-by-Example: A Data Base Language , 1977, IBM Syst. J..

[3]  Xavier Serra,et al.  Querying Freesound with a microphone , 2015 .

[4]  Jordi Janer,et al.  Sound Retrieval From Voice Imitation Queries In Collaborative Databases , 2014, Semantic Audio.

[5]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[6]  Aaron C. Courville,et al.  Understanding Representations Learned in Deep Architectures , 2010 .

[7]  Zhiyao Duan,et al.  Retrieving sounds by vocal imitation recognition , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[8]  George Tzanetakis,et al.  A comparative evaluation of search techniques for query-by-humming using the MUSART testbed , 2007 .

[9]  Matias Lindgren,et al.  Deep learning for spoken language identification , 2020 .

[10]  Bryan Pardo,et al.  Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[12]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[13]  Bryan Pardo,et al.  Improving Content-based Audio Retrieval by Vocal Imitation Feedback , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[15]  Zhiyao Duan,et al.  IMINET: Convolutional semi-siamese networks for sound search by vocal imitation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[16]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[17]  G. Montavon Deep learning for spoken language identification , 2009 .

[18]  Zhiyao Duan,et al.  Visualization and Interpretation of Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Thierry Bertin-Mahieux,et al.  Large-scale cover song recognition using hashed chroma landmarks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[20]  Hwee Tou Ng,et al.  A lattice-based approach to query-by-example spoken document retrieval , 2008, SIGIR '08.

[21]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[22]  Meinard Müller,et al.  Known Artist Live Song ID: A Hashprint Approach , 2016, ISMIR.

[23]  Zhiyao Duan,et al.  IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Zafar Rafii,et al.  An audio fingerprinting system for live version identification using image processing techniques , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Patrick Susini,et al.  The Timbre Toolbox: extracting audio descriptors from musical signals. , 2011, The Journal of the Acoustical Society of America.

[26]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[27]  Gaël Richard,et al.  Drum Loops Retrieval from Spoken Queries , 2005, Journal of Intelligent Information Systems.

[28]  Tuomas Virtanen,et al.  Audio Query by Example Using Similarity Measures between Probability Density Functions of Features , 2010, EURASIP J. Audio Speech Music. Process..