Suggesting Sounds for Images from Video Collections

Given a still image, humans can easily think of a sound associated with this image. For instance, people might associate the picture of a car with the sound of a car engine. In this paper we aim to retrieve sounds corresponding to a query image. To solve this challenging task, our approach exploits the correlation between the audio and visual modalities in video collections. A major difficulty is the high amount of uncorrelated audio in the videos, i.e., audio that does not correspond to the main image content, such as voice-over, background music, added sound effects, or sounds originating off-screen. We present an unsupervised, clustering-based solution that is able to automatically separate correlated sounds from uncorrelated ones. The core algorithm is based on a joint audio-visual feature space, in which we perform iterated mutual kNN clustering in order to effectively filter out uncorrelated sounds. To this end we also introduce a new dataset of correlated audio-visual data, on which we evaluate our approach and compare it to alternative solutions. Experiments show that our approach can successfully deal with a high amount of uncorrelated audio.

[1]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  Richard Szeliski,et al.  Building Rome in a day , 2009, ICCV.

[3]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[4]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[5]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Xiaoying Wu,et al.  A study of image-based music composition , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[7]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[8]  Min-Chun Hu,et al.  Semantic Based Background Music Recommendation for Home Videos , 2014, MMM.

[9]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[10]  M. A. Lea,et al.  Who do you look like? Evidence of facial stereotypes for male names , 2007, Psychonomic bulletin & review.

[11]  Roger B. Dannenberg,et al.  Sound Synthesis from Real-Time Video Images , 2003, ICMC.

[12]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[13]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[14]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Cheng-Te Li,et al.  Emotion-based impressionism slideshow with automatic music accompaniment , 2007, ACM Multimedia.

[16]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[17]  Doug L. James,et al.  Harmonic fluids , 2009, SIGGRAPH 2009.

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Peter B. L. Meijer,et al.  An experimental system for auditory image representations , 1992, IEEE Transactions on Biomedical Engineering.

[20]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[21]  Daniel Cohen-Or,et al.  Distilled Collections from Textual Image Queries , 2015, Comput. Graph. Forum.

[22]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[23]  Jon M. Kleinberg,et al.  Mapping the world's photos , 2009, WWW '09.

[24]  Xinghuo Yu,et al.  An approach for image sonification , 2004, First International Symposium on Control, Communications and Signal Processing, 2004..

[25]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[26]  Ming C. Lin,et al.  Example-guided physically based modal sound synthesis , 2013, ACM Trans. Graph..

[27]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yael Pritch,et al.  Saliency filters: Contrast based filtering for salient region detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Wilmot Li,et al.  Tools for placing cuts and transitions in interview video , 2012, ACM Trans. Graph..

[30]  Sebastian Michel,et al.  Picasso - to sing, you must close your eyes and draw , 2011, SIGIR '11.

[31]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[32]  Wilmot Li,et al.  UnderScore: musical underlays for audio stories , 2012, UIST '12.

[33]  Steven M. Seitz,et al.  Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[35]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Jun-Cheng Chen,et al.  Tiling Slideshow: An Audiovisual Presentation Method for Consumer Photos , 2007, IEEE MultiMedia.

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  Peter Dunker,et al.  Content-aware auto-soundtracks for personal photo music slideshows , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[39]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[40]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[41]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[42]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[43]  Yizhou Yu,et al.  Audeosynth: Music-driven Video Montage , 2015, ACM Trans. Graph..

[44]  Brian Wyvill,et al.  Robust iso-surface tracking for interactive character skinning , 2014, ACM Trans. Graph..

[45]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[46]  Ziv Bar-Joseph,et al.  Sound-by-numbers: motion-driven sound synthesis , 2003, SCA '03.

[47]  Dinesh K. Pai,et al.  FoleyAutomatic: physically-based sound effects for interactive simulation and animation , 2001, SIGGRAPH.

[48]  Mohan S. Kankanhalli,et al.  Music synthesis for home videos: an analogy based approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[49]  Rebecca Fiebrink,et al.  Cross-modal Sound Mapping Using Deep Learning , 2013, NIME.

[50]  Meinard Müller,et al.  Audio-based Music Structure Analysis , 2010 .

[51]  Simon J. Godsill,et al.  Digital audio restoration , 1998 .

[52]  Markus H. Gross,et al.  Scalable Music: Automatic Music Retargeting and Synthesis , 2013, Comput. Graph. Forum.

[53]  Ulrike von Luxburg,et al.  Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters , 2009, Theor. Comput. Sci..

[54]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[55]  Doug L. James,et al.  Animating fire with sound , 2011, SIGGRAPH 2011.

[56]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[57]  Ulrike von Luxburg,et al.  Cluster Identification in Nearest-Neighbor Graphs , 2007, ALT.

[58]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Huizhong Chen,et al.  What's in a Name? First Names as Facial Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Thomas Brox,et al.  A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[61]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[62]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[63]  Riccardo Miotto,et al.  A Generative Context Model for Semantic Music Annotation and Retrieval , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Thabo Beeler,et al.  FaceDirector: Continuous Control of Facial Performance in Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[66]  D.R. Reddy,et al.  Speech recognition by machine: A review , 1976, Proceedings of the IEEE.

[67]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[68]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[69]  Derek Nowrouzezahrai,et al.  Learning hatching for pen-and-ink illustration of surfaces , 2012, TOGS.

[70]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Doug L. James,et al.  Rigid-body fracture sound with precomputed soundbanks , 2010, ACM Trans. Graph..