Dynamic Equilibrium Module for Action Recognition

Temporal variations, such as sudden motion, acceleration and occlusions, occur frequently in real-world videos and force video-modeling networks to account for them. However. often they are not beneficial for recognizing actions at coarse granularity and thus may impede spatio-temporal learning. Prior solutions to this problem usually introduce multiple network branches to process input frames at different sampling rates or employ special components to explore inter-frame relations, which are computationally expensive. In this paper we propose a simple and flexible Dynamic Equilibrium Module (DEM) for video modeling through adaptive Eulerian motion manipulation. The proposed module can be directly inserted into 3D and (2+1)D backbone networks to effectively reduce the impact of temporal variations on video modeling and learn spatio-temporal representations with higher robustness. We demonstrate performance gains due to the use of DEM in R3D and R(2+1)D models on Kinetics-400, UCF-101, and HMDB-51 datasets.