The NCHLT speech corpus of the South African languages

The NCHLT speech corpus contains wide-band speech from approximately 200 speakers per language, in each of the eleven official languages of South Africa. We describe the design and development processes that were undertaken in order to develop the corpus, and report on associated materials such as orthographic transcriptions and pronunciation dictionaries that were released as part of the corpus. In order to benchmark speechrecognition performance on the corpus, we have also developed both phone-recognition and word-recognition systems for all eleven languages; we find that high accuracies can be achieved for these speaker-independent but vocabulary-dependent recognition tasks in all languages.

[1]  Roald Eiselen,et al.  Developing Text Resources for Ten South African Languages , 2014, LREC.

[2]  Etienne Barnard,et al.  ASR corpus design for resource-scarce languages , 2009, INTERSPEECH.

[3]  Etienne Barnard,et al.  The utility of spoken dialog systems , 2008, 2008 IEEE Spoken Language Technology Workshop.

[4]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[5]  Alta de Waal,et al.  Woefzela - An Open-Source Platform for ASR Data Collection in the Developing World , 2011, INTERSPEECH.

[6]  Etienne Barnard,et al.  The semi-automated creation of stratified speech corpora , 2013 .

[7]  Johan Schalkwyk,et al.  Voice search for development , 2010, INTERSPEECH.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[10]  Ian R. Lane,et al.  Tools for Collecting Speech Corpora via Mechanical-Turk , 2010, Mturk@HLT-NAACL.

[11]  Marelie H. Davel,et al.  Verifying pronunciation dictionaries using conflict analysis , 2010, INTERSPEECH.

[12]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[13]  Etienne Barnard,et al.  Validating smartphone-collected speech corpora , 2012, SLTU.

[14]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[15]  Ronald Rosenfeld,et al.  Speech vs. touch-tone: Telephony interfaces for information access by low literate users , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[16]  Thomas Niesler,et al.  The African Speech Technology Project: An Assessment , 2004, LREC.

[17]  Etienne Barnard,et al.  Basic speech recognition for spoken dialogues , 2009, INTERSPEECH.

[18]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[19]  Thad Hughes,et al.  Building transcribed speech corpora quickly and cheaply for many languages , 2010, INTERSPEECH.

[20]  Etienne Barnard,et al.  Efficient Harvesting of Internet Audio for Resource-Scarce ASR , 2011, INTERSPEECH.

[21]  Alta de Waal,et al.  A smartphone-based ASR data collection tool for under-resourced languages , 2014, Speech Commun..

[22]  Thomas Niesler,et al.  Comparing manually-developed and data-driven rules for P2P learning , 2009 .

[23]  Marelie H. Davel,et al.  Pronunciation dictionary development in resource-scarce environments , 2009, INTERSPEECH.