IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems

For this year’s TRECVID Multimedia Event Detection task [11], our team studied a semantic approach to video retrieval. We constructed a faceted taxonomy of 1313 visual concepts (including attributes and dynamic action concepts) and 85 audio concepts. Event search was performed via keyword search with a human user in-the-loop. Our submitted runs included PreSpecified and Ad-Hoc event collections. For each collection, we submitted 3 exemplar conditions: 0, 10, and 100 exemplars. For each exemplar condition, we also submitted 3 types of semantic modality retrieval results: visual only, audio only, and combined. The current IBM-Columbia MER system exploits nine observations about human cognition, language, and visual perception in order to produce an effective video recounting of an event. It designed and tuned algorithms that both locate a minimal persuasive video segment, and script a minimal verbal collection of concepts, in order to convince an analyst that the MED decision was correct. With little loss of descriptive clarity. the system achieved the highest speed-up ratio amongst the ten teams competing in the NIST MER evaluation. For SED, we seek to explore temporal dependencies between events for enhancing both evaluation tasks, i.e automatic event detection (retrospective) and interactive event detection with human in the loop (interactive). Our retrospective system is based on a joint-segmentation-detection framework integrated with temporal event modeling while the interactive system per∗Columbia University, Dept. of Electrical Engineering †IBM T. J. Watson Research Center ‡Columbia University, Dept. of Computer Science §Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. forms risk analysis to guide the end user for effective verification. We achieve better results on the retrospective and interactive tasks than last year. For SIN, we submitted 4 full concept detection runs, and 2 concept pair runs. In the first 3 concept detection runs, we changed our data sampling strategy between using balanced bags via majority undersampling for ensemble fusion learning, balanced bags via minority oversampling, and unbalanced bags. For the 4th run we used a rank normalized fusion of the first 3 runs. Concept pair runs consisted of the sum of individual concept classifiers with and without sigmoid normalization of the dataset. 1 Multimedia Event Detection (MED)

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  Daniel P. W. Ellis,et al.  IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System , 2011, TRECVID.

[5]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[6]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[7]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[8]  A. Smeaton,et al.  TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[9]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[10]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Apostol Natsev,et al.  Web-based information content and its application to concept-based video retrieval , 2008, CIVR '08.

[12]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[13]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[14]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Dong Liu,et al.  $\propto$SVM for learning with label proportions , 2013, ICML 2013.

[16]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Ammad Ali,et al.  Face Recognition with Local Binary Patterns , 2012 .

[18]  Rong Yan,et al.  Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce , 2009, LS-MMRM '09.

[19]  Samy Bengio,et al.  Sound Retrieval and Ranking Using Sparse Auditory Representations , 2010, Neural Computation.

[20]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..