Tunisian Dialectal End-to-end Speech Recognition based on DeepSpeech

Abstract Recognize automatically the spontaneous Human speech and transcribe it into text is becoming an important task. However, freely available models are rare especially for under-resourced languages and dialects since they require large amounts of data in order to achieve high performances. This paper describes an approach to build an end-to-end Tunisian dialect speech system based on deep learning. For this propose, a Tunisian dialect paired text-speech dataset called "TunSpeech" was created. Existing Modern Standard Arabic (MSA) speech data was also combined with dialectal Tunisian data and decreased the Out-Of-Vocabulary rate and improve perplexity. On the other hand, synthetic dialectal data from a text to speech increased the Word Error Rate.