Similarity Analysis of Visual Sketch-based Search for Sounds

Searching through a large audio database for a specific sound can be a slow and tedious task with detrimental effects on creative workflow. Listening to each sample is time consuming, while textual descriptions or tags may be insufficient, unavailable or simply unable to meaningfully capturing certain sonic qualities. This paper explores the use of visual sketches that express the mental model associated with a sound to accelerate the search process. To achieve this, a study was conducted to collect data on how 30 people visually represent sound, by providing hand-sketched visual representations for a range of 30 different sounds. After augmenting the data to a sparse set of 855 samples, two different autoencoder were trained. The one finds similar sketches in latent space and delivers the associated audio files. The other one is a multimodal autoencoder combining both visual and sonic cues in a common feature space but lacks on having no audio input for the search task. These both were then used to implement and discuss a visual query-by-sketch search interface for sounds.

[1]  Pietro Liò,et al.  XFlow: Cross-Modal Deep Neural Networks for Audiovisual Classification , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Keizo Oyama,et al.  Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[3]  Toshikazu Kato,et al.  Database architecture for content-based image retrieval , 1992, Electronic Imaging.

[4]  Jing Wang,et al.  Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval , 2016, ICIC.

[5]  Rohit Biswas,et al.  An effective analysis of deep learning based approaches for audio based feature extraction and its visualization , 2018, Multimedia Tools and Applications.

[6]  Hsin-Min Wang,et al.  A Query-by-Singing System for Retrieving Karaoke Music , 2008, IEEE Transactions on Multimedia.

[7]  W. Köhler Gestalt psychology , 1967 .

[8]  Lars Engeln,et al.  CoHEARence of audible shapes—a qualitative user study for coherent visual audio design with resynthesized shapes , 2020, Personal and Ubiquitous Computing.

[9]  Ya-Xi Chen,et al.  ThumbnailDJ: Visual Thumbnails of Music Content , 2010, ISMIR.

[10]  Kejun Zhang,et al.  Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition , 2019, IEEE Access.

[11]  Arthur Flexer,et al.  Identification of perceptual qualities in textural sounds using the repertory grid method , 2011, AM '11.

[12]  Jyh-Shing Roger Jang,et al.  Query by Tapping: A New Paradigm for Content-Based Music Retrieval from Acoustic Input , 2001, IEEE Pacific Rim Conference on Multimedia.

[13]  Kazuko Shinohara,et al.  A Cross-linguistic Study of Sound Symbolism: The Images of Size , 2010 .

[14]  György Fazekas,et al.  Sketching sounds: an exploratory study on sound-shape associations , 2021, ArXiv.

[15]  Masashi Yamamuro,et al.  A practical query-by-humming system for a large music database , 2000, ACM Multimedia.

[16]  Peter Knees,et al.  Searching for Audio by Sketching Mental Images of Sound: A Brave New Idea for Audio Retrieval in Creative Music Production , 2016, ICMR.

[17]  Yueting Zhuang,et al.  Cross-modal correlation learning for clustering on image-audio dataset , 2007, ACM Multimedia.

[18]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[19]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[21]  Arthur Flexer,et al.  Visualization of perceptual qualities in Textural sounds , 2012, ICMC.

[22]  Anne Treisman,et al.  Natural cross-modal mappings between visual and auditory features. , 2011, Journal of vision.

[23]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Lei Chen,et al.  Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[25]  Jun Guo,et al.  SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Jean Rouat,et al.  Audiovisual correspondence between musical timbre and visual shapes , 2014, Front. Hum. Neurosci..

[27]  Jordi Torres,et al.  Cross-modal Embeddings for Video and Audio Retrieval , 2018, ECCV Workshops.

[28]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[29]  Gerhard Widmer,et al.  Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies , 2019, IEEE Signal Processing Magazine.

[30]  Ajay Kapur,et al.  Query-by-Beat-Boxing: Music Retrieval For The DJ , 2004, ISMIR.

[31]  Liang Wang,et al.  Deep Self-Supervised Representation Learning for Free-Hand Sketch , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Jörn Loviscach,et al.  Music Icons: Procedural Glyphs for Audio Files , 2006, 2006 19th Brazilian Symposium on Computer Graphics and Image Processing.

[33]  Qingming Huang,et al.  Multi-modal semantic autoencoder for cross-modal retrieval , 2019, Neurocomputing.

[34]  Peter Knees,et al.  Conversations with Expert Users in Music Retrieval and Research Challenges for Creative MIR , 2016, ISMIR.

[35]  A. Vouloumanos,et al.  Sound symbolism in infancy: evidence for sound-shape cross-modal correspondences in 4-month-olds. , 2013, Journal of experimental child psychology.

[36]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[37]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[38]  Chengqi Zhang,et al.  Design and Applications of Intelligent Agents , 2001, Lecture Notes in Computer Science.

[39]  Gary Lupyan,et al.  What Does a Horgous Look Like? Nonsense Words Elicit Meaningful Drawings , 2019, Cognitive Sciences.

[40]  Xuelong Li,et al.  Visual music and musical vision , 2008, Neurocomputing.

[41]  Xiaogang Wang,et al.  Bridging Music and Image via Cross-Modal Ranking Analysis , 2016, IEEE Transactions on Multimedia.

[42]  Hong Zhang,et al.  Bridging the Gap Between Visual and Auditory Feature Spaces for Cross-Media Retrieval , 2007, MMM.

[43]  Qi Jia,et al.  Query by sketch: An asymmetric sketch-vs-image retrieval system , 2011, 2011 4th International Congress on Image and Signal Processing.

[44]  Jianmin Wang,et al.  Correlation Autoencoder Hashing for Supervised Cross-Modal Search , 2016, ICMR.

[45]  Masataka Goto,et al.  Musicream: Integrated Music-Listening Interface for Active, Flexible, and Unexpected Encounters with Musical Pieces , 2009, J. Inf. Process..