Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers