Mutual information guided 3D ResNet for self-supervised video representation learning

In this work, the authors propose a novel self-supervised learning method based on mutual information to learn representations from the videos without manual annotation. Different video clips sampled from the same video usually have coherence in the temporal domain. To guide the network to learn such temporal coherence, they maximise the mutual information between global features extracted from different clips sampled from the same video (Global-MI). However, maximising the Global-MI leads the network to seek shared content from different video clips and may make the network degenerate to focus on the background of the video. Considering the structure of the video, they further maximise the average mutual information between the global feature and local patches of multiple regions of the video clip (multi-region Local-MI). Their approach, which is called Max-GL, learns the temporal coherence by jointly maximising the Global-MI and multi-region Local-MI. Experiments are conducted to validate the effectiveness of the proposed Max-GL. Experimental results show that the Max-GL can serve as an effective pre-training method for the task of action recognition in videos. Additional experiments for the task of action similarity labelling and dynamic scene recognition also validate the generalisation of the learned representations of the Max-GL.

[1]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[4]  Ying Zhang,et al.  HMDB: the Human Metabolome Database , 2007, Nucleic Acids Res..

[5]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[6]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[7]  Tal Hassner,et al.  The Action Similarity Labeling Challenge , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.