Siamese Vision Transformers are Scalable Audio-visual Learners