Rapid Collection of Spontaneous Speech Corpora Using Telephonic Community Forums

We present a novel technique for rapid collection of spontaneous speech data over mobile phone channel using telephonic community forums. Our public forum allows users to post audio messages, listen to messages posted by others, post votes and audio comments, and share content with friends through subsidized phone calls. The entertainment aspects and sharing features of the forum lead to its viral spread in Pakistan. Within 8 months, it reached 11,017 users and gathered 1,207 hours of speech data comprising 57,454 audio-posts and 130,685 audiocomments, spanning Urdu and 9 regional languages. We trained an ASR using just 9.5 hours of the corpus to obtain 24.19% WER. Community forums automatically overcome common spontaneous speech data collection challenges like speaker recruitment, natural speech elicitation, content diversity, informed consent, sampling real-world ambient noise, and reach (for geographically remote linguistic communities). This technique is especially useful for gathering speech corpora for underresourced languages hence enabling the development of speech recognition, keyword spotting, speaker ID, and noise classification systems (among others) for such languages. It also allows rapid, automatic preservation of spoken languages and oral aspects of culture. This technique can be extended to collect speech data for endangered languages, oral cultures, and linguistic minorities.

[1]  George Saon,et al.  Feature and model space speaker adaptation with full covariance Gaussians , 2006, INTERSPEECH.

[2]  Sarmad Hussain,et al.  District names speech corpus for Pakistani Languages , 2015, 2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[3]  Tufail Muhammad,et al.  ARTIFICIAL NEURAL NETWORK-BASED SPEECH RECOGNITION USING DWT ANALYSIS APPLIED ON ISOLATED WORDS FROM ORIENTAL LANGUAGES , 2015 .

[4]  Virender Kadyan,et al.  Punjabi Automatic Speech Recognition Using HTK , 2012 .

[5]  Nasir Ahmad,et al.  Concatenative based Pashto Digits and Numbers Synthesizer , 2013 .

[6]  Ronald A. Cole,et al.  The OGI 22 language telephone speech corpus , 1995, EUROSPEECH.

[7]  Naveed Sarfraz Khattak,et al.  Speaker Independent Urdu speech recognition using HMM , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[8]  C. Moseley,et al.  Atlas Of The World’s Languages In Danger , 2015 .

[9]  Sarmad Hussain,et al.  Urdu speech recognition system for district names of Pakistan: Development, challenges and solutions , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[10]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[11]  Abhishek Behl,et al.  INNOVATIVE IVR SYSTEM FOR FARMERS: ENHANCING ICT ADOPTION , 2014 .

[12]  Ronald Rosenfeld,et al.  HealthLine: Speech-based access to health information by low-literate users , 2007, 2007 International Conference on Information and Communication Technologies and Development.

[13]  Rohan Samarajiva,et al.  Cellbazaar, a Mobile-Based E-Marketplace: Success Factors and Potential for Expansion , 2010 .

[14]  Thad Hughes,et al.  Building transcribed speech corpora quickly and cheaply for many languages , 2010, INTERSPEECH.

[15]  Gaetano Borriello,et al.  Sangeet Swara: A Community-Moderated Voice Forum in Rural India , 2015, CHI.

[16]  Stavros Tsakalidis,et al.  Pashto speech recognition with limited pronunciation lexicon , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Alvin F. Martin,et al.  The NIST 1999 Speaker Recognition Evaluation - An Overview , 2000, Digit. Signal Process..

[18]  Omar Farooq,et al.  A Medium Vocabulary Urdu Isolated Words Balanced Corpus for Automatic Speech Recognition , 2012 .

[19]  Arbab Waseem Abbas,et al.  Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN , 2015, Int. J. Speech Technol..

[20]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Agha Ali Raza,et al.  Speech Corpus Development for a Speaker Independent Spontaneous Urdu Speech Recognition System , 2010 .

[22]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[25]  Agha Ali Raza,et al.  Viral entertainment as a vehicle for disseminating speech-based services to low-literate users , 2012, ICTD '12.

[26]  Nitendra Rajput,et al.  A comparative study of speech and dialed input voice interfaces in rural India , 2009, CHI.

[27]  N. Ahmad,et al.  The development of isolated words pashto automatic speech recognition system , 2012, 18th International Conference on Automation and Computing (ICAC).

[28]  Agha Ali Raza,et al.  PronouncUR: An Urdu Pronunciation Lexicon Generator , 2018, LREC.

[29]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[30]  Agha Ali Raza,et al.  Design and development of phonetically rich Urdu speech corpus , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[31]  Hermann Ney,et al.  Improvements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, urdu, and Arabic , 2015, INTERSPEECH.

[32]  Sarmad Hussain,et al.  Large vocabulary continuous speech recognition for Urdu , 2010, FIT.

[33]  Agha Ali Raza,et al.  Job opportunities through entertainment: virally spread speech-based services for low-literate users , 2013, CHI.

[34]  Bhiksha Raj,et al.  Rapid development of public health education systems in low-literacy multilingual environments: combating ebola through voice messaging , 2015, SLaTE.

[35]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .

[36]  Kristin Precoda,et al.  Speech translation for low-resource languages: the case of Pashto , 2005, INTERSPEECH.

[37]  Agha Ali Raza,et al.  Baang: A Viral Speech-based Social Platform for Under-Connected Populations , 2018, CHI.

[38]  Parminder Singh,et al.  Text-To-Speech Synthesis System for Punjabi Language , 2011, ICIS 2011.

[39]  Agha Ali Raza,et al.  An ASR System for Spontaneous Urdu Speech , 2010 .

[40]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[41]  Li Deng,et al.  Large vocabulary word recognition using context-dependent allophonic hidden Markov models☆ , 1990 .