Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust to the retrieval of audio duplicates generated with speed transformations.

[1]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[2]  Antonio Garzon,et al.  MASK: Robust Local Features for Audio Fingerprinting , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System , 2002, ISMIR.

[6]  Pierre Dumouchel,et al.  A robust audio fingerprinting method for content-based copy detection , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  G. Ram Mohana Reddy,et al.  A Novel Approach to Video Copy Detection Using Audio Fingerprints and PCA , 2011, ANT/MobiWIS.

[9]  Ioannis Patras,et al.  FIVR: Fine-Grained Incident Video Retrieval , 2018, IEEE Transactions on Multimedia.

[10]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[13]  A. Aydin Alatan,et al.  Content Based Copy Detection with Coarse Audio-Visual Fingerprints , 2009, 2009 Seventh International Workshop on Content-Based Multimedia Indexing.

[14]  Jiajun Wang,et al.  VCDB: A Large-Scale Database for Partial Copy Detection in Videos , 2014, ECCV.

[15]  Wei Liu,et al.  Contented-Based Large Scale Web Audio Copy Detection , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[16]  Pierre Dumouchel,et al.  Efficient spectrogram-based binary image feature for audio copy detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Pierre Dumouchel,et al.  Fast Audio Fingerprinting System Using GPU and a Clustering-Based Technique , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Anurag Kumar,et al.  Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval with Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[20]  Patrick Gros,et al.  BABAZ: A large scale audio search system for video copy detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yiannis Kompatsiaris,et al.  Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers , 2017, MMM.

[22]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yiannis Kompatsiaris,et al.  ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[25]  Wu-Jun Li,et al.  SVD: A Large-Scale Short Video Dataset for Near-Duplicate Video Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).