Audio-visual fingerprinting and cross-modal aggregation: Components and applications

Within the last years the amount of digital media has been spread due to efficient media encoding algorithms. Hence, a large number of audio and video files are stored on the users hard disks and on popular video community platforms. Due to the lack of suitable or disobeyed metadata standards, the description of these data is often missing or misleading. Therefore, audio and visual identification algorithms have been developed, which identify videos or pieces of music and provide a suitable metadata description or copyright information based on a content database. Integrating both information, the visual and the audio part of the video for simultaneous identification is called cross-modal processing. In this paper the principle structure of an audio and a visual identification system is identified and different state-of-the-art algorithms are discussed. Furthermore, a cross-modal system is presented and especially the cross aggregation is discussed. Finally, current use cases for audio, visual and cross-modal search and retrieval are depicted.

[1]  Olivier Buisson,et al.  Robust Content-Based Video Copy Identification in a Large Reference Database , 2003, CIVR.

[2]  B. S. Manjunath,et al.  Introduction to MPEG-7: Multimedia Content Description Interface , 2002 .

[3]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System , 2002, ISMIR.

[4]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System With an Efficient Search Strategy , 2003 .

[5]  Akio Yamada,et al.  The MPEG-7 color layout descriptor: a compact image feature description for high-speed image/video segment retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[6]  Alberto Del Bimbo,et al.  Video Clip Matching Using MPEG-7 Descriptors and Edit Distance , 2006, CIVR.

[7]  Matthias Gruhne,et al.  Distributed Cross-Modal Search within the MPEG Query Format , 2008, 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services.

[8]  Avery Wang,et al.  The Shazam music recognition service , 2006, CACM.

[9]  Beng Chin Ooi,et al.  Towards effective indexing for very large video sequence database , 2005, SIGMOD '05.

[10]  A. Murat Tekalp,et al.  Robust color histogram descriptors for video segment retrieval and identification , 2002, IEEE Trans. Image Process..

[11]  Qi Tian,et al.  Fast and robust short video clip search using an index structure , 2004, MIR '04.

[12]  Eric Allamanche,et al.  Content-based Identification of Audio Material Using MPEG-7 Low Level Description , 2001, ISMIR.