Fusion of Compound Queries with Multiple Modalities for Known Item Video Search

Multimedia collections are ubiquitous and contain hundreds of hours of video information. The retrieval of a particular scene of a video (Known Item Search (KIS)) in a large collection is a difficult problem, considering the multimodal character of all video shots and the complexity of the query, either visual or textual. We tackle these challenges by fusing, first, multiple modalities in a nonlinear graph-based way for each subtopic of the query. Then, we fuse the top retrieved video shots per sub-query to provide the final list of retrieved shots, which is re-ranked using temporal information. The framework is evaluated in popular KIS tasks in the context of video shot retrieval and provides the largest Mean Reciprocal Rank scores.

[1]  Andrew Zisserman,et al.  Faces in Places: compound query retrieval , 2016, BMVC.

[2]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[4]  Benoit Huet,et al.  When textual and visual information join forces for multimedia retrieval , 2014, ICMR.

[5]  Alan F. Smeaton,et al.  A Comparison of Score, Rank and Probability-Based Fusion Methods for Video Shot Retrieval , 2005, CIVR.

[6]  Gabriela Csurka,et al.  Unsupervised Visual and Textual Information Fusion in CBMIR Using Graph-Based Methods , 2015, TOIS.

[7]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[8]  Yiannis Kompatsiaris,et al.  A hybrid graph-based and non-linear late fusion approach for multimedia retrieval , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[11]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[12]  Ioannis Patras,et al.  Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[13]  Larry S. Davis,et al.  Multi-Modal Image Retrieval for Complex Queries using Small Codes , 2014, ICMR.

[14]  Yiannis Kompatsiaris,et al.  Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression , 2017, Multimedia Tools and Applications.

[15]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[16]  Andrew Zisserman,et al.  Multiple queries for large scale specific object retrieval , 2012, BMVC.

[17]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.