Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language