Technische Universität Chemnitz at TRECVID Instance Search 2015

This contribution presents our second appearance at the TRECVID Instance Search task (Over et al., 2015; Smeaton et al., 2006). We participated in the evaluation campaign with four runs (one interactive and three automatic) using audiovisual concepts. A combination of different methods is used in every run. Our basic approach is based on probabilistic assumptions about the occurrences of instances. A deep learning convolutional neural network (CNN) is used in connection with the classification of filming locations and the analysis of audio tracks. The extraction of SIFT features is combined with K-Nearest Neighbors (KNN) clustering and matching to search for instances. In addition, we applied a sequence clustering method that incorporates visual similarity calculations between all corresponding shots in the omnibus episodes provided. Throughout all approaches, we make use of our adaptable and easy-to-use keyframe extraction scheme from the previous evaluation period (Ritter et al., 2014). 1 Structured Abstract 1. Briefly, list all the different sources of training data used in the creation of your system and its components. • For training issues, we solely used the given master shot reference, and the audio and video tracks of the first video with ID 0 from the provided BBC EastEnders video footage. 2. Briefly, what approach or combination of approaches did you test in each of your submitted runs? • Within the first interactive run I E TUC 1, we are using CNN & visual Bag-of-Word as well as SIFT & KNN based approaches in combination with audio-based indoor/outdoor detection and a probabilistic shot composition (PRNA) that is based on around 1.1 million extracted keyframes and thus shrinks the keyframe pool with respect to this years queries to around 18,000 available frames. • Our first automatic run F E TUC 2 combines CNN & visual Bag-of-Word approaches with audio analysis of Correspondence to: Marc Ritter marc.ritter@informatik.tu-chemnitz.de the three different classes indoor, outdoor, and crowd & machine. • The automatic run F A TUC 3 combines SIFT features with K-Nearest Neighbors (KNN) matching and deals as a baseline. • Our last automatic run F A TUC 4 combines our approach to partially semantic sequence clustering (SC) as input to the Probabilistic Run-length weighted Neighborhood Algorithm (PRNA) from the previous year that is built on probabilistic assumptions about the occurrences of instances. 3. What if any significant differences (in terms of what measures) did you find among the runs? • We present an adaptable and easy-to-use keyframe extraction scheme in order to reduce the large amount of 42 million frames to 1.1 million keyframes that were used for indexing or instance comparison at I E TUC MI 1. • As expected, and in terms of MAP, there is a significant difference of more than 13% between the interactive and the best fully automatic run. 2 Ritter et. al.: TUC at TRECVID Instance Search 2015 • The results of the run F A TUC 4 with SC & PRNA are promising within Precision at rank 30 (P30). Since the sequence clustering did not finish, some optimization potential is left to increase the resulting scores. 4. Based on the results, can you estimate the relative contribution of each component of your system/approach to its effectiveness? • The reduction scheme of extracting representative keyframes via preprocessing or even SC & PRNA is crucial to an efficient further processing. • I E TUC MI 1 and F A TUC 3 showed reasonable results for topics containing sharp edges using SIFT features. • The usability of our interactive GUI was significantly improved while allowing to review approximately 3,500 instance candidates on average per topic within the evaluation time frame leading to fast rejections of a large number of false positives. 5. Overall, what did you learn about runs/approaches and the research question(s) that motivated them? • The SC & PRNA method seems to be an usable heuristic for finding a set of new shots containing an instance based on some detected samples in the direct or indirect neighborhood, especially to boost the top 5 result entries at a Precision of almost 40%. • SIFT features deliver promising results for topics with specific properties. • An appropriate ranking algorithm needs to be developed in order to create stable results in the first 1,000 appearances above P(30). Additional preliminary tests with similarity measures like PSNR, structured similarity index and histogram correlation indicated insufficient ranking capabilities while being applied to 75 million image patches of the size 48×48 and thus were discontinued. Incorporation with machine learning methods might solve these aspects. The remainder of the paper is organized as follows: Section 2 provides a general view about the basic concepts and more common components of our system architecture and the underlying workflow for both run types. The specific algorithms that were used within the system, are described in section 3. Remarks regarding the official evaluation results are given in section 4 followed by some conclusions in section 5. 2 System Architecture The following section describes the overall system architecture and their components as well as the software and toolkits used to accomplish the instance search task. The preprocessing steps and keyframe extraction process applied to the original video footage and sample queries of the topics are discussed in section 2.1. In the section 2.2, the tools used for feature extraction and classification of filming locations based on audio tracks are illustrated. Our approach to deep learning is described in section 2.3. Another methods that are based on SIFT features and the MPEG-7 feature extraction library are described in section 2.4 and 2.5, respectively. 2.1 Preprocessing and Keyframe Extraction Our different approaches for feature extraction demand an abundant preprocessing on the given data. The underlying video collection from the BBC EastEnders series consists of 244 MPEG-4 omnibus video files each containing four episodes of around 30 minutes plus short additional video sequences like advertisements. As the data collection for the task Instance Search (INS) was maintained, we mostly retained the sequence of preprocessing steps described in our report from the previous TRECVID evaluation campaign (Ritter et al., 2014). We used the already built collection of 471,526 shots according to the given master shot reference table. Due to the anamorphic format provided, we applied deinterlacing routines and a pixel aspect ratio correction to square pixel (resulting in a resolution of 1,024× 576 pixel by utilizing FFMPEG1. To further reduce the information that needs to be processed by our image processing chains, we decided to extract representative frames from each shot that we refer to as keyframes, according to our adaptive keyframe extraction scheme from last year; see Figure 2 in (Ritter et al., 2014). By selecting up to five frames per shot, the method is capable of reducing the number of frames from 42 million to 1.15 million. Instead of extracting full size images, we cropped each image at its full resolution by 48 pixels in horizontal and 32 pixels in vertical direction resulting in a resolution of 928×512 pixels. This is expected to reduce or even prevent statistical corruptions in the latter feature extraction processes by black borders or other artifacts at the margins of the pictures. As the query images and the corresponding masks of the test set were also given with an anamorphic equalization of pixels, we stretched them to squared pixels, too. This results in both query and mask ending up with the same aspect ratio as the index pictures in the corpus. When finished, we process the masks with a customized MATLAB function which delivers the coordinates and size of the bounding box that surrounds the marked white space which denotes the searched object in the full-size query image. As a final step, the coordinates are being mapped to the original picture to provide cut out object patches resulting in query images containing the searched object and a small part of the surrounding environment. 1http://www.ffmpeg.org, 06/02/2015 Ritter et. al.: TUC at TRECVID Instance Search 2015 3 For audio processing, we also used the same collection of audio-only versions of all shots, which were created at sampling rates of 16 kHz mono channel in 16 bit PCM format. 2.2 openSMILE & Weka The openSMILE feature extraction tool (Eyben et al., 2013) contains general audio signal processing functions which extract several speechand music-related features. The Low-Level Descriptors (LLDs) as well as the statistical functionals can be calculated with this tool. The LLDs include energy, spectral, cepstral (Mel Frequency Cepstral Coefficients—MFCC) features as well as logarithmic harmonic-to-noise ratio (HNR), spectral harmonicity, and psychoacoustic spectral sharpness. The statistical functionals contain for example means, extremes and percentiles. We used the openSMILE tool to extract large features from audio tracks of sample videos in order to classify the shots according to their filming locations. The Weka toolkit (Hall et al., 2009) is a machine learning and data mining software which we used for the classification of filming locations based on audio features. Therefore, a series of classifiers that have shown promising results for classification in the literature were selected.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[3]  Michael Storz,et al.  Rapid Model-Driven Annotation and Evaluation for Object Detection in Videos , 2015, HCI.

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Marcel Worring,et al.  Systematic evaluation of logical story unit segmentation , 2002, IEEE Trans. Multim..

[6]  Mubarak Shah,et al.  A Graph Theoretic Approach for Scene Detection in Produced Videos , 2003 .

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Thomas Sikora,et al.  The MPEG-7 visual standard for content description-an overview , 2001, IEEE Trans. Circuits Syst. Video Technol..

[9]  Maximilian Eibl,et al.  An Extensible Tool for the Annotation of Videos Using Segmentation and Tracking , 2011, HCI.

[10]  Mathias Lux,et al.  Lire: lucene image retrieval: an extensible java CBIR library , 2008, ACM Multimedia.

[11]  K. Strimmer,et al.  Feature selection in omics prediction problems using cat scores and false nondiscovery rate control , 2009, 0903.2003.

[12]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[13]  Yiannis Kompatsiaris,et al.  Differential edit distance as a countermeasure to video scene ambiguity , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[14]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[15]  Boon-Lock Yeo,et al.  Segmentation of Video by Clustering and Graph Analysis , 1998, Comput. Vis. Image Underst..

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[18]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[19]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[20]  Alan Hanjalic,et al.  Automated high-level movie segmentation for advanced video-retrieval systems , 1999, IEEE Trans. Circuits Syst. Video Technol..

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Sameer Singh,et al.  Indoor vs. outdoor scene classification in digital photographs , 2005, Pattern Recognit..

[23]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[24]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.