论文信息 - Suggesting Sounds for Images from Video Collections

Suggesting Sounds for Images from Video Collections

Given a still image, humans can easily think of a sound associated with this image. For instance, people might associate the picture of a car with the sound of a car engine. In this paper we aim to retrieve sounds corresponding to a query image. To solve this challenging task, our approach exploits the correlation between the audio and visual modalities in video collections. A major difficulty is the high amount of uncorrelated audio in the videos, i.e., audio that does not correspond to the main image content, such as voice-over, background music, added sound effects, or sounds originating off-screen. We present an unsupervised, clustering-based solution that is able to automatically separate correlated sounds from uncorrelated ones. The core algorithm is based on a joint audio-visual feature space, in which we perform iterated mutual kNN clustering in order to effectively filter out uncorrelated sounds. To this end we also introduce a new dataset of correlated audio-visual data, on which we evaluate our approach and compare it to alternative solutions. Experiments show that our approach can successfully deal with a high amount of uncorrelated audio.

[1] Michael Elad,et al. Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2] Richard Szeliski,et al. Building Rome in a day , 2009, ICCV.

[3] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[4] Diane J. Cook,et al. Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[5] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Xiaoying Wu,et al. A study of image-based music composition , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[7] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[8] Min-Chun Hu,et al. Semantic Based Background Music Recommendation for Home Videos , 2014, MMM.

[9] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[10] M. A. Lea,et al. Who do you look like? Evidence of facial stereotypes for male names , 2007, Psychonomic bulletin & review.

[11] Roger B. Dannenberg,et al. Sound Synthesis from Real-Time Video Images , 2003, ICMC.

[12] Frédo Durand,et al. The visual microphone , 2014, ACM Trans. Graph..

[13] Luc Van Gool,et al. The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[14] Gert R. G. Lanckriet,et al. Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Cheng-Te Li,et al. Emotion-based impressionism slideshow with automatic music accompaniment , 2007, ACM Multimedia.

[16] Alexei A. Efros,et al. What makes Paris look like Paris? , 2015, Commun. ACM.

[17] Doug L. James,et al. Harmonic fluids , 2009, SIGGRAPH 2009.

[18] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19] Peter B. L. Meijer,et al. An experimental system for auditory image representations , 1992, IEEE Transactions on Biomedical Engineering.

[20] M. R. Brito,et al. Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[21] Daniel Cohen-Or,et al. Distilled Collections from Textual Image Queries , 2015, Comput. Graph. Forum.

[22] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[23] Jon M. Kleinberg,et al. Mapping the world's photos , 2009, WWW '09.

[24] Xinghuo Yu,et al. An approach for image sonification , 2004, First International Symposium on Control, Communications and Signal Processing, 2004..

[25] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[26] Ming C. Lin,et al. Example-guided physically based modal sound synthesis , 2013, ACM Trans. Graph..

[27] Yoav Y. Schechner,et al. Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Yael Pritch,et al. Saliency filters: Contrast based filtering for salient region detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29] Wilmot Li,et al. Tools for placing cuts and transitions in interview video , 2012, ACM Trans. Graph..

[30] Sebastian Michel,et al. Picasso - to sing, you must close your eyes and draw , 2011, SIGIR '11.

[31] Benjamin Schrauwen,et al. Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[32] Wilmot Li,et al. UnderScore: musical underlays for audio stories , 2012, UIST '12.

[33] Steven M. Seitz,et al. Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[35] Stefan Carlsson,et al. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[36] Jun-Cheng Chen,et al. Tiling Slideshow: An Audiovisual Presentation Method for Consumer Photos , 2007, IEEE MultiMedia.

[37] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38] Peter Dunker,et al. Content-aware auto-soundtracks for personal photo music slideshows , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[39] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[40] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[41] Mubarak Shah,et al. Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[42] Thabo Beeler,et al. Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[43] Yizhou Yu,et al. Audeosynth: Music-driven Video Montage , 2015, ACM Trans. Graph..

[44] Brian Wyvill,et al. Robust iso-surface tracking for interactive character skinning , 2014, ACM Trans. Graph..

[45] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[46] Ziv Bar-Joseph,et al. Sound-by-numbers: motion-driven sound synthesis , 2003, SCA '03.

[47] Dinesh K. Pai,et al. FoleyAutomatic: physically-based sound effects for interactive simulation and animation , 2001, SIGGRAPH.

[48] Mohan S. Kankanhalli,et al. Music synthesis for home videos: an analogy based approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[49] Rebecca Fiebrink,et al. Cross-modal Sound Mapping Using Deep Learning , 2013, NIME.

[50] Meinard Müller,et al. Audio-based Music Structure Analysis , 2010 .

[51] Simon J. Godsill,et al. Digital audio restoration , 1998 .

[52] Markus H. Gross,et al. Scalable Music: Automatic Music Retargeting and Synthesis , 2013, Comput. Graph. Forum.

[53] Ulrike von Luxburg,et al. Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters , 2009, Theor. Comput. Sci..

[54] Beth Logan,et al. Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[55] Doug L. James,et al. Animating fire with sound , 2011, SIGGRAPH 2011.

[56] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[57] Ulrike von Luxburg,et al. Cluster Identification in Nearest-Neighbor Graphs , 2007, ALT.

[58] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Huizhong Chen,et al. What's in a Name? First Names as Facial Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60] Thomas Brox,et al. A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[61] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[62] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[63] Riccardo Miotto,et al. A Generative Context Model for Semantic Music Annotation and Retrieval , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[64] Thabo Beeler,et al. FaceDirector: Continuous Control of Facial Performance in Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[66] D.R. Reddy,et al. Speech recognition by machine: A review , 1976, Proceedings of the IEEE.

[67] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[68] P. Mermelstein,et al. Distance measures for speech recognition, psychological and instrumental , 1976 .

[69] Derek Nowrouzezahrai,et al. Learning hatching for pen-and-ink illustration of surfaces , 2012, TOGS.

[70] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Doug L. James,et al. Rigid-body fracture sound with precomputed soundbanks , 2010, ACM Trans. Graph..