On web-based creation of speech resources for less-resourced languages

Web-based creation of speech resources is a new paradigm for producing spoken language resources. It is particularly suited for less resourced languages, i.e. languages for which no readily available speech resources exist. This paper maps the speech resource creation tasks to the client-server architecture of the WWW. It presents two tools that have been developed for webbased speech resource creation, and it demonstrates the effectiveness of this approach by three use cases: 1) high bandwidth recordings of new speaker populations in geographically distributed locations, 2) recordings in adverse recording environments, e.g. hospitals, and 3) field recordings of endangered languages. The only infrastructure requirements are electricity for the equipment and an Internet connection.

[1]  Christoph Draxler,et al.  Speech Recordings in Public Schools in Germany - the Perfect Show Case for Web-based Recordings and Annotation , 2006, LREC.

[2]  Steven Bird,et al.  Models and Tools for Collaborative Annotation , 2002, LREC.

[3]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus. Overview and First Evaluation , 2000, LREC.

[4]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[5]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  John J. Godfrey,et al.  Macrophone: an American English telephone speech corpus for the Polyphone project , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Christoph Draxler WWWSigTranscribe, a JAVA Extension of the WWWTranscribe toolbox , 1998 .

[8]  Ineke Schuurman,et al.  CGN, an annotated corpus of spoken Dutch , 2003, LINC@EACL.

[9]  Christoph Draxler,et al.  SpeechRecorder - a Universal Platform Independent Multi-Channel Audio Recording Software , 2004, LREC.

[10]  Christoph Draxler,et al.  WebTranscribe - An Extensible Web-Based Speech Annotation Framework , 2005, TSD.

[11]  Eric Sanders,et al.  Speechdat multilingual speech databases for teleservices: across the finish line , 1999, EUROSPEECH.

[12]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus and its Exploitation Environment , 2003, LINC@EACL.

[13]  Florian Schiel,et al.  RVG 1 - A Database for Regional Variants of Contemporary German , 2007 .