Cross-modal supervision for learning active speaker detection in video