A smartphone-based ASR data collection tool for under-resourced languages

Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with under-resourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the development of a smartphone-based data collection tool, Woefzela, which is designed to function in a developing world context. Specifically, this tool is designed to function without any Internet connectivity, while remaining portable and allowing for the collection of multiple sessions in parallel; it also simplifies the data collection process by providing process support to various role players during the data collection process, and performs on-device quality control in order to maximise the use of recording opportunities. The use of the tool is demonstrated as part of a South African data collection project, during which almost 800 hours of ASR data was collected, often in remote, rural areas, and subsequently used to successfully build acoustic models for eleven languages. The on-device quality control mechanism (referred to as QC-on-the-go) is an interesting aspect of the Woefzela tool and we discuss this functionality in more detail. We experiment with different uses of quality control information, and evaluate the impact of these on ASR accuracy. Woefzela was developed for the Android Operating System and is freely available for use on Android smartphones.

[1]  Etienne Barnard,et al.  Trajectory behaviour at different phonemic context sizes , 2011 .

[2]  Thomas Niesler,et al.  Resource development and experiments in automatic south african broadcast news transcription , 2012, SLTU.

[3]  Baden Hughes,et al.  Frontiers in Linguistic Annotation for Lower-Density Languages , 2006 .

[4]  Swaran Lata,et al.  Development of Linguistic Resources and Tools for Providing Multilingual Solutions in Indian Languages - A Report on National Initiative , 2010, LREC.

[5]  Eric Sanders,et al.  Validation of spoken language resources: an overview of basic aspects , 2008, Lang. Resour. Evaluation.

[6]  Florian Schiel,et al.  The Production of Speech Corpora , 2012 .

[7]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[8]  Ian R. Lane,et al.  Tools for Collecting Speech Corpora via Mechanical-Turk , 2010, Mturk@HLT-NAACL.

[9]  Etienne Barnard,et al.  Pooling ASR data for closely related languages , 2010, SLTU.

[10]  Etienne Barnard,et al.  HIV health information access using spoken dialogue systems: Touchtone vs. speech , 2009, 2009 International Conference on Information and Communication Technologies and Development (ICTD).

[11]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[12]  Peter F. MacNeilage,et al.  The Production of Speech , 2011, Springer New York.

[13]  Johan Schalkwyk,et al.  Voice search for development , 2010, INTERSPEECH.

[14]  Willi-Hans Steeb,et al.  Tools in C , 2008 .

[15]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[16]  Maxine Eskénazi,et al.  Speaking to the Crowd: Looking at Past Achievements in Using Crowdsourcing for Speech and Predicting Future Challenges , 2011, INTERSPEECH.

[17]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[18]  Mike Schuster,et al.  Speech Recognition for Mobile Devices at Google , 2010, PRICAI.

[19]  Etienne Barnard,et al.  The utility of spoken dialog systems , 2008, 2008 IEEE Spoken Language Technology Workshop.

[20]  Neil Kleynhans,et al.  Acoustic model optimisation for a call routing system , 2012 .

[21]  Etienne Barnard,et al.  Medium-vocabulary speech recognition for under-resourced languages , 2012, SLTU.

[22]  Alex Pentland,et al.  DakNet: rethinking connectivity in developing nations , 2004, Computer.

[23]  Etienne Barnard,et al.  Pronunciation prediction with Default&Refine , 2008, Comput. Speech Lang..

[24]  Thomas Niesler,et al.  The African Speech Technology Project: An Assessment , 2004, LREC.

[25]  Doug Beeferman,et al.  Say what? why users choose to speak their web queries , 2010, INTERSPEECH.

[26]  Thomas Hain,et al.  Recent advances in broadcast news transcription , 2003 .

[27]  James R. Glass,et al.  Collecting Voices from the Cloud , 2010, LREC.

[28]  Ronald Rosenfeld,et al.  HealthLine: Speech-based access to health information by low-literate users , 2007, 2007 International Conference on Information and Communication Technologies and Development.

[29]  Thomas Niesler,et al.  The design, collection and annotation of speech databases in South Africa , 2006 .

[30]  Etienne Barnard,et al.  Processing spoken lectures in resource-scarce environments , 2011 .

[31]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[32]  G Botha,et al.  Two approaches to gathering text corpora from the WorldWideWeb , 2005 .

[33]  Alta de Waal,et al.  Quality measurements for mobile data collection in the developing world , 2012, SLTU.

[34]  Etienne Barnard,et al.  ASR corpus design for resource-scarce languages , 2009, INTERSPEECH.

[35]  Etienne Barnard,et al.  The Lwazi community communication service: design and piloting of a voice-based information service , 2011, WWW.

[36]  Marelie H. Davel,et al.  Pronunciation dictionary development in resource-scarce environments , 2009, INTERSPEECH.

[37]  Christoph Draxler On web-based creation of speech resources for less-resourced languages , 2007, INTERSPEECH.

[38]  Alta de Waal,et al.  Developing a Broadband Automatic Speech Recognition System for Afrikaans , 2011, INTERSPEECH.

[39]  Etienne Barnard,et al.  A Southern African corpus for multilingual name pronunciation , 2011 .

[40]  Marelie H. Davel,et al.  Context-dependent modelling of English vowels in Sepedi code-switched speech , 2012 .

[41]  Marelie H. Davel,et al.  Comparing grapheme-based and phoneme-based speech recognition for Afrikaans , 2012 .

[42]  Etienne Barnard,et al.  Collecting and evaluating speech recognition corpora for 11 South African languages , 2011, Lang. Resour. Evaluation.

[43]  Alta de Waal,et al.  Woefzela - An Open-Source Platform for ASR Data Collection in the Developing World , 2011, INTERSPEECH.

[44]  Etienne Barnard,et al.  Towards effective telephone-based delivery of government services , 2003 .

[45]  Eric A. Brewer,et al.  The case for technology in developing regions , 2005, Computer.

[46]  James R. Glass,et al.  A Transcription Task for Crowdsourcing with Automatic Quality Control , 2011, INTERSPEECH.

[47]  Etienne Barnard,et al.  Validating smartphone-collected speech corpora , 2012, SLTU.

[48]  Heinrich Niemann,et al.  SpeeData: multilingual spoken data entry , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[49]  A. Constantinescu,et al.  On cross-language experiments and data-driven units for ALISP (Automatic Language Independent Speech Processing) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[50]  Jiulong Shan,et al.  Search by voice in Mandarin Chinese , 2010, INTERSPEECH.

[51]  Martin Jansche,et al.  Deploying Google Search by Voice in Cantonese , 2011, INTERSPEECH.

[52]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[53]  Etienne Barnard,et al.  Efficient Harvesting of Internet Audio for Resource-Scarce ASR , 2011, INTERSPEECH.

[54]  Marthinus W. Pretorius,et al.  The South African Human Language Technology Audit , 2011, Lang. Resour. Evaluation.

[55]  Thad Hughes,et al.  Building transcribed speech corpora quickly and cheaply for many languages , 2010, INTERSPEECH.

[56]  Kazuhiro Kondo,et al.  An evaluation of cross-language adaptation for rapid HMM development in a new language , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[58]  Steven Bird,et al.  The Human Language Project: Building a Universal Corpus of the World's Languages , 2010, ACL.

[59]  Etienne Barnard,et al.  Speech Technology for Information Access: a South African Case Study , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[60]  Joyojeet Pal,et al.  The challenges of technology research for developing regions , 2006, IEEE Pervasive Computing.