Vocal Imitation Set: a dataset of vocally imitated sound events using the AudioSet ontology

Query-By-Vocal Imitation (QBV) search systems enable searching a collection of audio files using a vocal imitation as a query. This can be useful when sounds do not have commonly agreed-upon textlabels, or many sounds share a label. As deep learning approaches have been successfully applied to QBV systems, datasets to build models have become more important. We present Vocal Imitation Set, a new vocal imitation dataset containing 11, 242 crowd-sourced vocal imitations of 302 sound event classes in the AudioSet sound event ontology. It is the largest publicly-available dataset of vocal imitations as well as the first to adopt the widely-used AudioSet ontology for a vocal imitation dataset. Each imitation recording in Vocal Imitation Set was rated by a human listener on how similar the imitation is to the recording it was an imitation of. Vocal Imitation Set also has an average of 10 different original recordings per sound class. Since each sound class has about 19 listener-vetted imitations and 10 original sound files, the data set is suited for training models to do fine-grained vocal imitation-based search within sound classes. We provide an example of using the dataset to measure how well the existing state-of-the-art in QBV search performs on fine-grained search.

[1]  Bryan Pardo,et al.  VocalSketch: Vocally Imitating Audio Concepts , 2015, CHI.

[2]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Carlo Drioli,et al.  Organizing a sonic space through vocal imitations , 2016 .

[4]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[5]  Mark Sandler,et al.  Towards a comprehensive dataset of vocal imitations of drum sounds , 2016 .

[6]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[7]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Davide Rocchesso,et al.  Innovative Tools for Sound Sketching Combining Vocalizations and Gestures , 2016, Audio Mostly Conference.

[10]  D. Rocchesso,et al.  On the effectiveness of vocal imitations and verbal descriptions of sounds. , 2014, The Journal of the Acoustical Society of America.

[11]  Guillaume Lemaitre,et al.  Vocal Imitations of Non-Vocal Sounds , 2016, PloS one.

[12]  YICHI ZHANG,et al.  Supervised and Unsupervised Sound Retrieval by Vocal Imitation , 2016 .

[13]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Zhiyao Duan,et al.  Visualization and Interpretation of Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Davide Rocchesso,et al.  Sketching sound with voice and gesture , 2015, Interactions.

[16]  Zhiyao Duan,et al.  IMINET: Convolutional semi-siamese networks for sound search by vocal imitation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[17]  Jordi Janer,et al.  Sound Retrieval From Voice Imitation Queries In Collaborative Databases , 2014, Semantic Audio.

[18]  Xavier Serra,et al.  Querying Freesound with a microphone , 2015 .

[19]  Bryan Pardo,et al.  Crowdsourcing A Real-World On-Line Query By Humming System , 2010 .

[20]  Bryan Pardo,et al.  SynthAssist: an audio synthesizer programmed with vocal imitation , 2014, ACM Multimedia.