Joint learning of images and videos with a single Vision Transformer