Mid-level Representation for Visual Recognition

Visual Recognition is one of the fundamental challenges in AI, where the goal is to understand the semantics of visual data. Employing mid-level representation, in particular, shifted the paradigm in visual recognition. The mid-level image/video representation involves discovering and training a set of mid-level visual patterns (e.g., parts and attributes) and represent a given image/video utilizing them. The mid-level patterns can be extracted from images and videos using the motion and appearance information of visual phenomenas. This thesis targets employing mid-level representations for different high-level visual recognition tasks, namely (i)image understanding and (ii)video understanding. In the case of image understanding, we focus on object detection/recognition task. We investigate on discovering and learning a set of mid-level patches to be used for representing the images of an object category. We specifically employ the discriminative patches in a subcategory-aware webly-supervised fashion. We, additionally, study the outcomes provided by employing the subcategory-based models for undoing dataset bias.

[1]  Ali Farhadi,et al.  Phrasal Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ze-Nian Li BEYOND ACTIONS : DISCRIMINATIVE MODELS FOR CONTEXTUAL GROUP ACTIVITIES , 2010 .

[3]  Shaogang Gong,et al.  Video Behavior Profiling for Anomaly Detection , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Stefan Carlsson,et al.  Self-tuned Visual Subclass Learning with Shared Samples An Incremental Approach , 2014, ArXiv.

[5]  Xiaofeng Ren,et al.  Discriminative Mixture-of-Templates for Viewpoint Classification , 2010, ECCV.

[6]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Yangsheng Xu,et al.  An energy model approach to people counting for abnormal crowd behavior detection , 2012, Neurocomputing.

[8]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[10]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[12]  Brett J. Borghetti,et al.  A Review of Anomaly Detection in Automated Surveillance , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Larry S. Davis,et al.  Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.

[14]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Subhransu Maji,et al.  Knowing a Good HOG Filter When You See It: Efficient Selection of Filters for Detection , 2014, ECCV.

[16]  Subhransu Maji,et al.  Object segmentation by alignment of poselet activations to image contours , 2011, CVPR 2011.

[17]  Dragomir Anguelov,et al.  Capturing Long-Tail Distributions of Object Subcategories , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[19]  Martial Hebert,et al.  Classifier Ensemble Recommendation , 2012, ECCV Workshops.

[20]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Jitendra Malik,et al.  Multi-component Models for Object Detection , 2012, ECCV.

[22]  Duan-Yu Chen,et al.  Dynamic human crowd modeling and its application to anomalous events detcetion , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[23]  Derek Hoiem,et al.  Learning Collections of Part Models for Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Tim J. Ellis,et al.  Learning semantic scene models from observing activity in visual surveillance , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[25]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[26]  Serge J. Belongie,et al.  Counting Crowded Moving Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Bo Wang,et al.  Abnormal crowd behavior detection using high-frequency and spatio-temporal features , 2011, Machine Vision and Applications.

[28]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Joshua B. Tenenbaum,et al.  Learning to share visual appearance for multiclass object detection , 2011, CVPR 2011.

[30]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[31]  Shaogang Gong,et al.  Recognising action as clouds of space-time interest points , 2009, CVPR.

[32]  Xiaogang Wang,et al.  Random field topic model for semantic region analysis in crowded scenes from tracklets , 2011, CVPR 2011.

[33]  Jitendra Malik,et al.  Training Deformable Part Models with Decorrelated Features , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Cordelia Schmid,et al.  Dataset Issues in Object Recognition , 2006, Toward Category-Level Object Recognition.

[35]  David A. McAllester,et al.  Visual object detection with deformable part models , 2013, CACM.

[36]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[37]  Tal Hassner,et al.  Violent flows: Real-time detection of violent crowd behavior , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[38]  Andrew Zisserman,et al.  Discriminative Sub-categorization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[40]  Alexei A. Efros,et al.  Undoing the Damage of Dataset Bias , 2012, ECCV.

[41]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[42]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[43]  Jitendra Malik,et al.  Discriminative Decorrelation for Clustering and Classification , 2012, ECCV.

[44]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[45]  Zhiwen Yu,et al.  A Bayesian Model for Crowd Escape Behavior Detection , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[46]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[47]  Soraia Raupp Musse,et al.  Crowd Analysis Using Computer Vision Techniques , 2010, IEEE Signal Processing Magazine.

[48]  Yang Wang,et al.  Retrieving Actions in Group Contexts , 2010, ECCV Workshops.

[49]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[50]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[51]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Andrew Zisserman,et al.  An Exemplar Model for Learning Object Classes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[55]  Gian Luca Foresti,et al.  Trajectory-Based Anomalous Event Detection , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[56]  Alessandro Perina,et al.  Abnormality Detection with Improved Histogram of Oriented Tracklets , 2015, ICIAP.

[57]  Alexei A. Efros,et al.  Object Instance Sharing by Enhanced Bounding Box Correspondence , 2012, BMVC.

[58]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[59]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Christian Bauckhage,et al.  Loveparade 2010: Automatic video analysis of a crowd disaster , 2012, Comput. Vis. Image Underst..

[62]  Greg Mori,et al.  From Subcategories to Visual Composites: A Multi-level Framework for Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[63]  Mubarak Shah,et al.  Learning motion patterns in crowded scenes using motion flow field , 2008, 2008 19th International Conference on Pattern Recognition.

[64]  Robert Bergevin,et al.  Semantic human activity recognition: A literature review , 2015, Pattern Recognit..

[65]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[66]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[67]  Wander Jager,et al.  Modelling Crowd dynamics, influence factors related to the probability of a riot , 2007 .

[68]  Larry S. Davis,et al.  A flow model for joint action recognition and identity maintenance , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[70]  Ming Yang,et al.  Regionlets for Generic Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[71]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[73]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[75]  Mubarak Shah,et al.  Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[77]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[78]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[79]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Yali Amit,et al.  Object Detection , 2020, Computer Vision, A Reference Guide.

[81]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  W. Eric L. Grimson,et al.  Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Stefan Carlsson,et al.  Mixture Component Identification and Learning for Visual Recognition , 2012, ECCV.

[84]  Ioannis Tsochantaridis,et al.  Support Vector Machines for Multi ple-Instance Learning , 2002 .

[85]  Martial Hebert, Co-chair , 2002 .

[86]  Stephen P. Boyd,et al.  Convex piecewise-linear fitting , 2009 .

[87]  Martial Hebert, Co-chair , 2002 .

[88]  Hichem Snoussi,et al.  Histograms of Optical Flow Orientation for Visual Abnormal Events Detection , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[89]  Xinlei Chen,et al.  Enriching Visual Knowledge Bases via Object Discovery and Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Ramakant Nevatia,et al.  Bayesian human segmentation in crowded situations , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[91]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[92]  Ko Nishino,et al.  Tracking with local spatio-temporal motion patterns in extremely crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[93]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[94]  M.H. Sharif,et al.  Crowd behaviour monitoring on the escalator exits , 2008, 2008 11th International Conference on Computer and Information Technology.

[95]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[96]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[97]  Mubarak Shah,et al.  Abnormal crowd behavior detection using social force model , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[98]  Junsong Yuan,et al.  Sparse reconstruction cost for abnormal event detection , 2011, CVPR 2011.

[99]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[100]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[101]  Alessandro Perina,et al.  Crowd motion monitoring using tracklet-based commotion measure , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[102]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[103]  Tieniu Tan,et al.  A system for learning statistical motion patterns , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[104]  Qixiang Ye,et al.  Human Detection in Images via Piecewise Linear Support Vector Machines , 2013, IEEE Transactions on Image Processing.

[105]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.

[106]  Kristen Grauman,et al.  Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates , 2009, CVPR.

[107]  David A. McAllester,et al.  Object Detection with Grammar Models , 2011, NIPS.

[108]  Massimiliano Pontil,et al.  Learning with dataset bias in latent subcategory models , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Lu Yong,et al.  Video-Based Detection of Abnormal Behavior in the Examination Room , 2010, 2010 International Forum on Information Technology and Applications.

[110]  Andrei Zaharescu,et al.  Anomalous Behaviour Detection Using Spatiotemporal Oriented Energies, Subset Inclusion Histogram Comparison and Event-Driven Processing , 2010, ECCV.

[111]  Shaogang Gong,et al.  Scene Segmentation for Behaviour Correlation , 2008, ECCV.

[112]  Ali Farhadi,et al.  Attribute-centric recognition for cross-category generalization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[113]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[114]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[115]  Trevor Darrell,et al.  Discovering Latent Domains for Multisource Domain Adaptation , 2012, ECCV.

[116]  Peter H. Tu,et al.  Simultaneous estimation of segmentation and shape , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[117]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[118]  Alessio Del Bue,et al.  Temporal Poselets for Collective Activity Detection and Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[119]  Louis Kratz,et al.  Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models , 2009, CVPR.

[120]  Jorge S. Marques,et al.  Tracking Groups of Pedestrians in Video Sequences , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[121]  Christian Bauckhage,et al.  Analyzing pedestrian behavior in crowds for automatic detection of congestions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[122]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[123]  Alessandro Sperduti,et al.  Multiclass Classification with Multi-Prototype Support Vector Machines , 2005, J. Mach. Learn. Res..

[124]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[125]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[126]  Alexei A. Efros,et al.  How Important Are "Deformable Parts" in the Deformable Parts Model? , 2012, ECCV Workshops.

[127]  Mubarak Shah,et al.  Identifying Behaviors in Crowd Scenes Using Stability Analysis for Dynamical Systems , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[128]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[129]  Alessandro Perina,et al.  Analyzing Tracklets for the Detection of Abnormal Crowd Behavior , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[130]  Shaogang Gong,et al.  Global Behaviour Inference using Probabilistic Latent Semantic Analysis , 2008, BMVC.

[131]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[132]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[133]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[134]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[135]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[136]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[137]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[138]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[139]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[140]  Shaogang Gong,et al.  A Markov Clustering Topic Model for mining behaviour in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[141]  Junseok Kwon,et al.  A unified framework for event summarization and rare event detection , 2012, CVPR.

[142]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[143]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[144]  Thomas Deselaers,et al.  Visual and semantic similarity in ImageNet , 2011, CVPR 2011.

[145]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[146]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.