CRMActive: An Active Learning Based Approach for Effective Video Annotation and Retrieval

Conventional multimedia annotation/retrieval systems such as Normalized Continuous Relevance Model (NormCRM)[7] require a fully labeled training data for a good performance. Active Learning, by determining an order for labeling the training data, allows for a good performance even before the training data is fully annotated. In this work we propose an active learning algorithm, which combines a novel measure of sample uncertainty with a novel clustering-based approach for determining sample density and diversity and integrate it with NormCRM. The clusters are also iteratively refined to ensure both feature and label-level agreement among samples. We show that our approach outperforms multiple baselines both on a new, open dataset and on the popular TRECVID corpus at both the tasks of annotation and text-based retrieval of videos.

[1]  R. Manmatha,et al.  Statistical models for automatic video annotation and retrieval , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Louis-Philippe Morency,et al.  A Multimodal Context-based Approach for Distress Assessment , 2014, ICMI.

[3]  Shih-Fu Chang,et al.  Multimedia knowledge: discovery, classification, browsing, and retrieval , 2005 .

[4]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[5]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[6]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.

[7]  Li-Rong Dai,et al.  Video Annotation by Active Learning and Cluster Tuning , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[8]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[9]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[10]  Thomas S. Huang,et al.  Leveraging Active Learning for Relevance Feedback Using an Information Theoretic Diversity Measure , 2006, CIVR.

[11]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[12]  Louis-Philippe Morency,et al.  Verbal Behaviors and Persuasiveness in Online Multimedia Content , 2014, SocialNLP@COLING.

[13]  Jan Sedmidubský,et al.  Retrieving Similar Movements in Motion Capture Data , 2013, SISAP.

[14]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[15]  Wei Liu,et al.  Multimedia classification and event detection using double fusion , 2013, Multimedia Tools and Applications.

[16]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[17]  Yi Yang,et al.  Interactive Video Indexing With Statistical Active Learning , 2012, IEEE Transactions on Multimedia.

[18]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Louis-Philippe Morency,et al.  Acoustic and para-verbal indicators of persuasiveness in social multimedia , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Edward Y. Chang,et al.  Active Learning for Interactive Multimedia Retrieval , 2008, Proceedings of the IEEE.

[21]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Louis-Philippe Morency,et al.  Context-based signal descriptors of heart-rate variability for anxiety assessment , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).