Learning Temporal Dynamics in Videos With Image Transformer