Visual Features and Their Own Optical Flow

Symmetries, invariances and conservation equations have always been an invaluable guide in Science to model natural phenomena through simple yet effective relations. For instance, in computer vision, translation equivariance is typically a built-in property of neural architectures that are used to solve visual tasks; networks with computational layers implementing such a property are known as Convolutional Neural Networks (CNNs). This kind of mathematical symmetry, as well as many others that have been recently studied, are typically generated by some underlying group of transformations (translations in the case of CNNs, rotations, etc.) and are particularly suitable to process highly structured data such as molecules or chemical compounds which are known to possess those specific symmetries. When dealing with video streams, common built-in equivariances are able to handle only a small fraction of the broad spectrum of transformations encoded in the visual stimulus and, therefore, the corresponding neural architectures have to resort to a huge amount of supervision in order to achieve good generalization capabilities. In the paper we formulate a theory on the development of visual features that is based on the idea that movement itself provides trajectories on which to impose consistency. We introduce the principle of Material Point Invariance which states that each visual feature is invariant with respect to the associated optical flow, so that features and corresponding velocities are an indissoluble pair. Then, we discuss the interaction of features and velocities and show that certain motion invariance traits could be regarded as a generalization of the classical concept of affordance. These analyses of feature-velocity interactions and their invariance properties leads to a visual field theory which expresses the dynamical constraints of motion coherence and might lead to discover the joint evolution of the visual features along with the associated optical flows.

[1]  Ronald P. A. Petrick,et al.  Affordances in Robotic Tasks - A Survey , 2020, ArXiv.

[2]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[3]  Tao Mei,et al.  Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure , 2016, IJCAI.

[4]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[5]  Karl J. Friston The free-energy principle: a unified brain theory? , 2010, Nature Reviews Neuroscience.

[6]  Dana H. Ballard,et al.  Animate Vision , 1991, Artif. Intell..

[7]  A. Dantcheva,et al.  G3AN: Disentangling Appearance and Motion for Video Generation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  D. Knill,et al.  The Bayesian brain: the role of uncertainty in neural coding and computation , 2004, Trends in Neurosciences.

[9]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  M. Gori,et al.  Learning Visual Features Under Motion Invariance , 2019, Neural Networks.

[11]  Stefano Melacci,et al.  Wave Propagation of Visual Stimuli in Focus of Attention , 2020, ArXiv.

[12]  Karl J. Friston,et al.  Attention, Uncertainty, and Free-Energy , 2010, Front. Hum. Neurosci..

[13]  Wei Ji Ma,et al.  Bayesian inference with probabilistic population codes , 2006, Nature Neuroscience.

[14]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[15]  Pedro M. Domingos,et al.  Deep Symmetry Networks , 2014, NIPS.

[16]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[17]  Marcello Sanguineti,et al.  Foundations of Support Constraint Machines , 2015, Neural Computation.

[18]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[19]  Konstantinos Konstantinides,et al.  Image and Video Compression Standards: Algorithms and Architectures , 1997 .

[20]  Roberto Javier López-Sastre,et al.  Unsupervised learning from videos using temporal coherency deep networks , 2018, Comput. Vis. Image Underst..

[21]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[22]  Ali Borji,et al.  Saliency Prediction in the Deep Learning Era: Successes and Limitations , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[24]  A. Milner,et al.  How do the two visual streams interact with each other? , 2017, Experimental Brain Research.

[25]  Salman Khan,et al.  Visual Affordance and Function Understanding , 2018, ACM Comput. Surv..

[26]  Konstantinos Konstantinides,et al.  Image and video compression standards , 1995 .

[27]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[28]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[29]  Gjlles Aubert,et al.  Mathematical problems in image processing , 2001 .

[30]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Marco Gori,et al.  Constraint Verification With Kernel Machines , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Stefano Soatto,et al.  Video-based descriptors for object recognition , 2011, Image Vis. Comput..

[34]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[35]  Tomaso A. Poggio,et al.  Motion Field and Optical Flow: Qualitative Properties , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Ning Lv,et al.  Optical flow and scene flow estimation: A survey , 2021, Pattern Recognit..

[38]  Will Y. Zou Unsupervised learning of visual invariance with temporal coherence , 2011 .

[39]  Gustavo Deco,et al.  Predictive Coding in the Visual Cortex by a Recurrent Network with Gabor Receptive Fields , 2001, Neural Processing Letters.

[40]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[41]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[42]  Ruzena Bajcsy,et al.  Active and exploratory perception , 1992, CVGIP Image Underst..

[43]  Yiannis Aloimonos,et al.  Active vision , 2004, International Journal of Computer Vision.

[44]  Marco Gori,et al.  Gravitational Laws of Focus of Attention , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  John K. Tsotsos,et al.  Revisiting active perception , 2016, Autonomous Robots.