Combining multiple visual processing streams for locating and classifying objects in video

Automated, invariant object detection has proven itself to be a substantial challenge for the artificial intelligence research community. In computer vision, many different benchmarks have been established using whole-image classification based on datasets that are too small to eliminate statistical artifacts. As an alternative, we used a new dataset consisting of ~62GB (on the order of 40,000 2Mpixel frames) of compressed high-definition aerial video, which we employed for both object classification and localization. Our algorithms mimic the processing pathways in primate visual cortex, exploiting color/texture, shape/form and motion. We then combine the data using a clustering technique to produce a final output in the form of labeled bounding boxes around objects of interest in the video. Localization adds additional complexity not generally found in whole-image classification problems. Our results are evaluated qualitatively and quantitatively using a scoring metric that assessed the overlap between our detections and ground-truth.

[1]  Eero P. Simoncelli,et al.  A model of neuronal responses in visual area MT , 1998, Vision Research.

[2]  Nicolas Pinto,et al.  Establishing Good Benchmarks and Baselines for Face Recognition , 2008 .

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[5]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[6]  Garrett T. Kenyon,et al.  Large-scale functional models of visual cortex for remote sensing , 2009, 2009 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2009).

[7]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[9]  D. V. Essen,et al.  Neural mechanisms of form and motion processing in the primate visual system , 1994, Neuron.

[10]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[11]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[12]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[14]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[15]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[16]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[17]  Ilya Nemenman,et al.  Model Cortical Association Fields Account for the Time Course and Dependence on Target Complexity of Human Contour Perception , 2011, PLoS Comput. Biol..

[18]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[19]  Mitchell Melanie Visualizing classification decisions of hierarchical models of cortex , 2010 .