Retrieving sounds by vocal imitation recognition

Vocal imitation is widely used in human communication. In this paper, we propose an approach to automatically recognize the concept of a vocal imitation, and then retrieve sounds of this concept. Because different acoustic aspects (e.g., pitch, loudness, timbre) are emphasized in imitating different sounds, a key challenge in vocal imitation recognition is to extract appropriate features. Hand-crafted features may not work well for a large variety of imitations. Instead, we use a stacked auto-encoder to automatically learn features from a set of vocal imitations in an unsupervised way. Then, a multi-class SVM is trained for sound concepts of interest using their training imitations. Given a new vocal imitation of a sound concept of interest, our system can recognize its underlying concept and return it with a high rank among all concepts. Experiments show that our system significantly outperforms an MFCC-based comparison system in both classification and retrieval.

[1]  Wensheng Zhang,et al.  A novel sparse auto-encoder for deep unsupervised learning , 2013, 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI).

[2]  Bryan Pardo,et al.  VocalSketch: Vocally Imitating Audio Concepts , 2015, CHI.

[3]  Fuminori Kimura,et al.  Music Retrieval Using Onomatopoeic Query , 2013 .

[4]  Guillaume Lemaitre,et al.  Vocal Imitations and the Identification of Sound Events , 2011 .

[5]  Tetsuya Ogata,et al.  Sound sources selection system by using onomatopoeic querries from multiple sound sources , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Jordi Janer,et al.  Sound Retrieval From Voice Imitation Queries In Collaborative Databases , 2014, Semantic Audio.

[7]  Shin-ichiro Iwamiya,et al.  Comparisons of Auditory Impressions and Auditory Imagery Associated with Onomatopoeic Representation for Environmental Sounds , 2010, EURASIP J. Audio Speech Music. Process..

[8]  Preeti Rao,et al.  TANSEN: A QUERY-BY-HUMMING BASED MUSIC RETRIEVAL SYSTEM , 2003 .

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Shrikanth S. Narayanan,et al.  Vector-based Representation and Clustering of Audio Using Onomatopoeia Words , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[11]  Lie Lu,et al.  A new approach to query by humming in music retrieval , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[12]  Shrikanth S. Narayanan,et al.  Classification of sound clips by two schemes: Using onomatopoeia and semantic labels , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[13]  Xavier Serra,et al.  Querying Freesound with a microphone , 2015 .

[14]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[15]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[16]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.