Phone-Informed Refinement of Synthesized Mel Spectrogram for Data Augmentation in Speech Recognition

While recent end-to-end automatic speech recognition (ASR) models achieve high performance, we need to prepare an abundant amount of training data, which is a barrier to apply them to a specific domain. To mitigate the lack of training data, text-to-speech (TTS) systems have been utilized to leverage text-only data to efficiently generate paired data for training the ASR model. The widely-used procedure first generates a Mel spectrogram from text data, then converts it into a waveform, and converts it again to a Mel spectrogram. The vocoder is often used to alleviate the difference between real and synthesized speech, but it requires a huge amount of run-time. In this work, we propose a phone-informed post-processing network that refines Mel spectrograms without using the vocoder. The proposed network consumes not only Mel spectrograms but also text information to use phone sequence information for refinement. Experimental evaluations demonstrate that the proposed network achieves better WERs than the vocoder network in an English domain adaptation task (LibriSpeech to TED-LIUM 2; read speech to spontaneous speech) in a much smaller amount of data generation time. It is also shown the use of phone information is critical for the improvement. We also confirm the effect of the proposed model in a Japanese domain adaptation task (CSJ-SPS to CSJ-APS; everyday topic to academic topic).