Transcriber: Development and use of a tool for assisting speech corpora production

Abstract We present “Transcriber”, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of different data formats will become easier.

[1]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[2]  David McKelvie,et al.  The MATE workbench - An annotation tool for XML coded speech corpora , 2001, Speech Commun..

[3]  John B. Lowe,et al.  Linguistic documents synchronizing sound and text , 2001, Speech Commun..

[4]  Jonathan Harrington,et al.  Multi-level annotation in the Emu speech database management system , 2001, Speech Commun..

[5]  Djoerd Hiemstra,et al.  Language-Based Multimedia Information Retrieval , 2000, RIAO.

[6]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[7]  Joseph Picone,et al.  Resegmentation of SWITCHBOARD , 1998, ICSLP.

[8]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[9]  Mark Liberman,et al.  Annotation graphs as a framework for multidimensional linguistic data analysis , 1999, ArXiv.

[10]  Jonathan Robie,et al.  Document Object Model (DOM) Level 2 Specification , 1998 .

[11]  Jonathan Harrington,et al.  Multi-level Annotation of Speech: An Overview of The Emu Speech Database Management System , 1999 .

[12]  Chris DiBona,et al.  Open Sources: Voices from the Open Source Revolution , 1999 .

[13]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[14]  Michael K. McCandless,et al.  A model for interactive computation: applications to speech research , 1998 .

[15]  Brian MacWhinney,et al.  The CHILDES Project: Tools for Analyzing Talk (third edition): Volume I: Transcription format and programs, Volume II: The database , 2000, Computational Linguistics.

[16]  Joakim Gustafson,et al.  Web-based educational tools for speech technology , 1998, ICSLP.

[17]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[18]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[19]  Michael K. McCandless,et al.  SAPPHIRE: an extensible speech analysis and recognition tool based on Tcl/Tk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Mark Davis,et al.  The Unicode Standard, Version 3.0 , 2000 .

[21]  Mark Liberman,et al.  Transcriber: a free tool for segmenting, labeling and transcribing speech , 1998, LREC.

[22]  Claude Barras,et al.  Transcribing with Annotation Graphs , 2000, LREC.

[23]  John K. Ousterhout,et al.  Scripting: Higher-Level Programming for the 21st Century , 1998, Computer.

[24]  John K. Ousterhout,et al.  Tcl and the Tk Toolkit , 1994 .