论文信息 - Speech Corpora of Under Resourced Languages of North-East India

Speech Corpora of Under Resourced Languages of North-East India

In this paper, we present an account of an ongoing effort in creation of speech corpora of under-resourced languages of North-East India, namely, Assamese, Bengali and Nepali. The speech corpora are being created for development of Automatic Speech Recognition system in Assamese as well as for Language Identification system. The text corpus of Assamese language comprises of 1000 sentences collected from different sources such as story books, novels, proverbs. Speech data are recorded over telephone channel using an interactive voice response system. Speakers were asked to read one or more sets of sentences, each set containing 20 sentences. Speech was simultaneously recorded using a hand-held audio recorder. While significant amount of speech data has been collected for Assamese language, the task has begun for Bengali, Nepali and English spoken by native speakers of these 3 languages. Currently, the Assamese speech database contains more than 5000 utterances by 27 native speakers. Information about the speakers such as dialect, gender, age-group were also collected. We discuss the methodology used in collecting speech samples, and present a descriptive statistics of the speech corpora.

[1] Tanja Schultz,et al. Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[2] Mark J. F. Gales,et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.