Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association