An evaluation of Mongolian data-driven Text-to-Speech

This paper presents a first attempt to evaluate data-driven speech synthesis of Mongolian trained on 1500-sentence female speech corpus. The speech corpus contains nearly 6 hours of Mongolian female speech that is designed to cover all Mongolian phones. The evaluation is done on two levels. In overall quality evaluation, we generated 25 sentences and asked raters about their quality based on Mean Opinion Score (MOS). The second evaluation uses Phoneme confusion test, which contains all possible phoneme set in Mongolian.