Influenсe of Phone-Viseme Temporal Correlations on Audiovisual STT and TTS Performance

In this paper, we present a research of temporal correlations of audiovisual units in continuous Russian speech. The corpus-based study identifies natural time asynchronies between flows of audible and visible speech modalities partially caused by inertance of the articulation organs. Original methods for speech asynchrony modeling have been proposed and studied using bimodal ASR and TTS systems. The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.