ConViT: improving vision transformers with soft convolutional inductive biases