Discovery and organization of multi-camera user-generated videos of the same event

We propose a framework for the automatic grouping and alignment of unedited multi-camera User-Generated Videos (UGVs) within a database. The proposed framework analyzes the sound in order to match and cluster UGVs that capture the same spatio-temporal event and estimate their relative time-shift to temporally align them. We design a descriptor derived from the pairwise matching of audio chroma features of UGVs. The descriptor facilitates the definition of a classification threshold for automatic query-by-example event identification. We evaluate the proposed identification and synchronization framework on a database of 263 multi-camera recordings of 48 real-world events and compare it with state-of-the-art methods. Experimental results show the effectiveness of the proposed approach in the presence of various audio degradations.

[1]  Justin Manweiler,et al.  FOCUS: clustering crowdsourced videos by line-of-sight , 2013, SenSys '13.

[2]  Mor Naaman,et al.  Social multimedia: highlighting opportunities for search and mining of multimedia data in social media applications , 2010, Multimedia Tools and Applications.

[3]  Emilia Gómez Gutiérrez,et al.  Tonal description of music audio signals , 2006 .

[4]  Avery Wang,et al.  The Shazam music recognition service , 2006, CACM.

[5]  Peter Grosche,et al.  Analyzing Chroma Feature Types for Automated Chord Recognition , 2011, Semantic Audio.

[6]  Jiajun Wang,et al.  A Robust Audio Feature Extraction Algorithm for Music Identification , 2010 .

[7]  Sebastian Ewert,et al.  The Audio Degradation Toolbox and Its Application to Robustness Evaluation , 2013, ISMIR.

[8]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[9]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[10]  Mor Naaman,et al.  Less talk, more rock: automated organization of community-contributed collections of concert videos , 2009, WWW '09.

[11]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[12]  Anil C. Kokaram,et al.  Temporal synchronization of multiple audio signals , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Moncef Gabbouj,et al.  Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing , 2012, MMM.

[14]  Daniel P. W. Ellis,et al.  Audio fingerprinting to identify multiple videos of an event , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[16]  Hung-Khoon Tan,et al.  Beyond search: Event-driven summarization for web videos , 2011, TOMCCAP.

[17]  Massimiliano Pontil,et al.  Support Vector Machines: Theory and Applications , 2001, Machine Learning and Its Applications.

[18]  Moncef Gabbouj,et al.  Sport Type Classification of Mobile Videos , 2014, IEEE Transactions on Multimedia.

[19]  Gregory H. Wakefield,et al.  Audio thumbnailing of popular music using chroma-based representations , 2005, IEEE Transactions on Multimedia.

[20]  Fei Wang,et al.  Real-time large scale near-duplicate web video retrieval , 2010, ACM Multimedia.

[21]  Li Chen,et al.  Video copy detection: a comparative study , 2007, CIVR '07.

[22]  Zi Huang,et al.  Near-duplicate video retrieval: Current research and future trends , 2013, CSUR.

[23]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[24]  Daniel P. W. Ellis,et al.  Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  David J. Ketchen,et al.  THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE , 1996 .

[26]  Raphaël Troncy,et al.  Finding media illustrating events , 2011, ICMR '11.

[27]  Hans Weda,et al.  Synchronization of Multiple Camera Videos Using Audio-Visual Features , 2010, IEEE Transactions on Multimedia.

[28]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[29]  Paris Smaragdis,et al.  Clustering and synchronizing multi-camera video via landmark cross-correlation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Antoni B. Chan,et al.  Automatic Music Tagging With Time Series Models , 2010, ISMIR.

[31]  Jian Lu,et al.  Video fingerprinting for copy identification: from research to industry applications , 2009, Electronic Imaging.

[32]  Andrea Cavallaro,et al.  Audio-visual events for multi-camera synchronization , 2015, Multimedia Tools and Applications.

[33]  Hila Becker,et al.  Event Identification in Social Media , 2009, WebDB.

[34]  Mauro Barbieri,et al.  Synchronization of multi-camera video recordings based on audio , 2007, ACM Multimedia.

[35]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[36]  Meinard Müller,et al.  Audio Matching via Chroma-Based Statistical Features , 2005, ISMIR.

[37]  Peter Grosche,et al.  High resolution audio synchronization using chroma onset features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Touradj Ebrahimi,et al.  In Tags We Trust: Trust modeling in social tagging of multimedia content , 2012, IEEE Signal Processing Magazine.

[39]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System With an Efficient Search Strategy , 2003 .

[40]  Perry R. Cook,et al.  Music, cognition, and computerized sound: an introduction to psychoacoustics , 1999 .

[41]  Jeroen Breebaart,et al.  Features for audio and music classification , 2003, ISMIR.

[42]  Pedro Cano,et al.  A Review of Audio Fingerprinting , 2005, J. VLSI Signal Process..

[43]  Yu He,et al.  The YouTube video recommendation system , 2010, RecSys '10.

[44]  Namrata Sahayam,et al.  Speech Recognition Using Euclidean Distance , 2013 .

[45]  Markus Cremer,et al.  Content identification in consumer applications , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[46]  Yap-Peng Tan,et al.  Video organization: Near-Duplicate Video clustering , 2012, 2012 IEEE International Symposium on Circuits and Systems.