Semantic Aware Video Transcription Using Random Forest Classifiers

This paper focuses on transcription generation in the form of subject, verb, object (SVO) triplets for videos in the wild, given off-the-shelf visual concept detectors. This problem is challenging due to the availability of sentence only annotations, the unreliability of concept detectors, and the lack of training samples for many words. Facing these challenges, we propose a Semantic Aware Transcription (SAT) framework based on Random Forest classifiers. It takes concept detection results as input, and outputs a distribution of English words. SAT uses video, sentence pairs for training. It hierarchically learns node splits by grouping semantically similar words, measured by a continuous skip-gram language model. This not only addresses the sparsity of training samples per word, but also yields semantically reasonable errors during transcription. SAT provides a systematic way to measure the relatedness of a concept detector to real words, which helps us understand the relationship between current visual detectors and words in a semantic space. Experiments on a large video dataset with 1,970 clips and 85,550 sentences are used to demonstrate our idea.

[1]  Ramakant Nevatia,et al.  ACTIVE: Activity Concept Transitions in Video Event Classification , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[4]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[5]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[9]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Fei-Fei Li,et al.  Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[12]  Thomas Deselaers,et al.  ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[13]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Tamara L. Berg,et al.  Baby Talk : Understanding and Generating Image Descriptions , 2011 .

[17]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[18]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[21]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[22]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Jonathan Krause,et al.  Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Yash Goyal,et al.  CloudCV: Large-Scale Distributed Computer Vision as a Cloud Service , 2015, Mobile Cloud Visual Media Computing.

[25]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[26]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[27]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[30]  Hui Cheng,et al.  Video event recognition using concept attributes , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[31]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.