FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION