论文信息 - Combining multiple visual processing streams for locating and classifying objects in video

Combining multiple visual processing streams for locating and classifying objects in video

Automated, invariant object detection has proven itself to be a substantial challenge for the artificial intelligence research community. In computer vision, many different benchmarks have been established using whole-image classification based on datasets that are too small to eliminate statistical artifacts. As an alternative, we used a new dataset consisting of ~62GB (on the order of 40,000 2Mpixel frames) of compressed high-definition aerial video, which we employed for both object classification and localization. Our algorithms mimic the processing pathways in primate visual cortex, exploiting color/texture, shape/form and motion. We then combine the data using a clustering technique to produce a final output in the form of labeled bounding boxes around objects of interest in the video. Localization adds additional complexity not generally found in whole-image classification problems. Our results are evaluated qualitatively and quantitatively using a scoring metric that assessed the overlap between our detections and ground-truth.

[1] Eero P. Simoncelli,et al. A model of neuronal responses in visual area MT , 1998, Vision Research.

[2] Nicolas Pinto,et al. Establishing Good Benchmarks and Baselines for Face Recognition , 2008 .

[3] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4] Kunihiko Fukushima,et al. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[5] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[6] Garrett T. Kenyon,et al. Large-scale functional models of visual cortex for remote sensing , 2009, 2009 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2009).

[7] Thomas Serre,et al. Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Graham W. Taylor,et al. Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[9] D. V. Essen,et al. Neural mechanisms of form and motion processing in the primate visual system , 1994, Neuron.

[10] Nicolas Pinto,et al. Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[11] Thomas Serre,et al. A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[12] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13] G. Griffin,et al. Caltech-256 Object Category Dataset , 2007 .

[14] E H Adelson,et al. Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[15] Antonio Torralba,et al. Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[16] D. Hubel,et al. Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[17] Ilya Nemenman,et al. Model Cortical Association Fields Account for the Time Course and Dependence on Target Complexity of Human Contour Perception , 2011, PLoS Comput. Biol..

[18] T. Poggio,et al. Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[19] Mitchell Melanie. Visualizing classification decisions of hierarchical models of cortex , 2010 .