Towards a Visual Story Network Using Multiple Views for Object Recognition at Different Levels of Spatiotemporal Context

We present a general computational multi-class visual recognition model, which we term the Visual Story Network (VSN). Our proposed model aims to generalize and integrate ideas from different successful hierarchical automated recognition approaches, relating models from computer vision and brain science, such as today’s successful deep neural networks and more classical ideas for visual learning and recognition from neuroscience, such as the well-known Adaptive Resonance Theory. Our recursive graph-based model has the advantage of enabling rich interactions between classes and features from different levels of interpretation and abstraction. The Visual Story Network offers multiple views of a visual concept: the basic, bottom-up view, is based on the objects’s current local appearance. The higher level view is based on the larger spatiotemporal context, such as the role played by that concept in the overall story. This story includes the spatial relations and interactions to other objects, as well as events and global information from the scene. The structure of the VSN can be efficiently constructed by step by step updates, during which new features or complex classifiers are added one by one. Given a certain VSN structure, its weights could also be fully learned or fine-tuned, end-to-end, by efficient methods such as backpropagation with stochastic gradient descent. VSN is, in its general form, a graph of nonlinear classifiers or feature nodes that are automatically selected from a large pool and combined to form new nodes. Then, each newly learned node becomes a potential new usable feature. Our feature pool can contain both manually designed features or more complex classifiers pre-learned from previous steps, each copied many times at different scales and locations. In this manner we can learn and grow both a deep, complex graph of classifiers and a rich pool of features at different levels of abstraction and interpretation. At every stage the VSN cand be fully trained, end-to-end, either in a supervised way or in a novel naturally self-supervised way, which we will discuss in detail. Our proposed graph of classifiers becomes a multi-class system with a recursive structure, suitable for deep detection and recognition of several classes simultaneously.

[1]  D. George,et al.  A hierarchical Bayesian model of invariant pattern recognition in the visual cortex , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[4]  Yanxi Liu,et al.  Online Selection of Discriminative Tracking Features , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[6]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[7]  Martial Hebert,et al.  Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[12]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[14]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[15]  S. Grossberg The complementary brain: unifying brain dynamics and modularity , 2000, Trends in Cognitive Sciences.

[16]  Zhuowen Tu,et al.  Auto-Context and Its Application to High-Level Vision Tasks and 3D Brain Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[18]  S. Grossberg,et al.  View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds , 2009, Cognitive Psychology.

[19]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[22]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[23]  N. Sigala,et al.  Visual categorization shapes feature selectivity in the primate temporal cortex , 2002, Nature.

[24]  Qiang Chen,et al.  Contextualizing Object Detection and Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[27]  Martial Hebert,et al.  Unsupervised Learning for Graph Matching , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  S. Grossberg,et al.  Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors , 1976, Biological Cybernetics.

[29]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[30]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[32]  Rahul Sukthankar,et al.  Labeling the Features Not the Samples: Efficient Video Classification with Minimal Supervision , 2015, AAAI.

[33]  E. Warrington,et al.  Visual associative agnosia: a clinico-anatomical study of a single case. , 1986, Journal of neurology, neurosurgery, and psychiatry.

[34]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[35]  S. Grossberg From brain synapses to systems for learning and memory: Object recognition, spatial navigation, timed conditioning, and movement control , 2015, Brain Research.

[36]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  J. Rowsell,et al.  Artifactual Literacies: Every Object Tells a Story , 2010 .

[38]  F. Michael Connelly,et al.  Stories of Experience and Narrative Inquiry , 1990 .

[39]  R. Schank,et al.  Knowledge and Memory: The Real Story , 1995 .

[40]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[41]  Cristian Sminchisescu,et al.  Generalized Boundaries from Multiple Image Interpretations , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Elizabeth K. Warrington,et al.  Visual Apperceptive Agnosia: A Clinico-Anatomical Study of Three Cases , 1988, Cortex.

[43]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[44]  G. Edelman,et al.  The Mindful Brain: Cortical Organization and the Group-Selective Theory of Higher Brain Function , 1978 .

[45]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[46]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[47]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[50]  Martial Hebert,et al.  A spectral technique for correspondence problems using pairwise constraints , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[51]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[52]  John R. Koza,et al.  Genetic Programming as a Darwinian Invention Machine , 1999, EuroGP.

[53]  Qiang Chen,et al.  Hierarchical matching with side information for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[55]  S. Grossberg,et al.  Where’s Waldo? How perceptual, cognitive, and emotional brain processes cooperate during learning to categorize and find desired objects in a cluttered scene , 2014, Front. Integr. Neurosci..

[56]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[57]  Hrishikesh B. Aradhye,et al.  Video2Text: Learning to Annotate Video Content , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[58]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[59]  Jitendra Malik,et al.  Shape Context: A New Descriptor for Shape Matching and Object Recognition , 2000, NIPS.

[60]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[62]  Stephen Grossberg,et al.  Adaptive Resonance Theory: How a brain learns to consciously attend, learn, and recognize a changing world , 2013, Neural Networks.

[63]  Aaron C. Koralek,et al.  Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills , 2012, Nature.

[64]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[65]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[66]  Pascal Vincent,et al.  Higher Order Contractive Auto-Encoder , 2011, ECML/PKDD.

[67]  Lei Wang,et al.  AdaBoost with SVM-based component classifiers , 2008, Eng. Appl. Artif. Intell..

[68]  Martial Hebert,et al.  Smoothing-based Optimization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Rahul Sukthankar,et al.  Thoughts on a Recursive Classifier Graph: a Multiclass Network for Deep Object Recognition , 2014, ArXiv.

[70]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.