AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR