Shared Linguistic Resources for the Meeting Domain

This paper describes efforts by the University of Pennsylvania's Linguistic Data Consortium to create and distribute shared linguistic resources --- including data, annotations, tools and infrastructure --- to support the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation. In addition to making available large volumes of training data to research participants, LDC produced reference transcripts for the NIST Phase II Corpus and RT-07 conference room evaluation set, which represent a variety of subjects, scenarios and recording conditions. For the 18-hour NIST Phase II Corpus, LDC created quick transcripts which include automatic segmentation and minimal markup. The 3-hour evaluation corpus required the creation of careful verbatim reference transcripts including manual segmentation and rich markup. The 2007 effort marked the second year of using the XTrans annotation tool kit in the meeting domain. We describe the process of creating transcripts for the RT-07 evaluation, and describe the advantages of utilizing XTrans for each phase of transcription and its positive impact on quality control and real-time transcription rates. This paper also describes the structure and results of a pilot consistency study that we conducted on the 3-hour test set. Finally, we present plans for further improvements to infrastructure and transcription methods.