Introduction to the Special Section on Rich Transcription

HE term Rich Transcription spans multiple areas in audio processing, and its study marks a broadening of the concerns of automatic speech recognition (ASR) to cover the affiliated areas necessary for maximally useful applications. Whereas classical speech recognition focuses purely on converting a sequence of audio words to a sequence of textual words—without regard for capitalization, punctuation, speaker identity, pragmatic intent, and other high-level information—rich transcription attempts to produce a more highly annotated and informative output. The study of rich transcription received a great impetus in 2002 when the Defense Advanced Research Projects Agency (DARPA) started the Effective Affordable Reusable Speech-to-Text (EARS) program. This program extended the previous HUB-4 and HUB-5 programs by adding an emphasis on metatdata extraction, in addition to traditional word recognition. The particular metadata tasks that were studied (http://nist.gov/speech/tests/rt/rt2004/fall/docs/rt04f-eval-planv14.doc) are as follows. • Speaker diarization: the problem of segmenting speech into regions where only one person is talking, and then linking together speech (possibly from disjoint regions of time) from the same speaker. • Identification of sentence-like units (SUs): the task of segmenting speech into units expressing separate thoughts or ideas, similar to sentences in written language, but taking into account that spoken language might not exhibit complete grammatical sentences. • Disfluency detection: the dual problems of detecting the speech locations where a fluent word stream is interrupted (interruption point detection), and identifying those words that need to be removed in order to obtain the fluent word sequence of the intended utterance. This involves the labeling of pause fillers (e.g., “uh”), edit words (e.g., “I mean”), and the words that the speaker meant to replace in a self-repair. Clearly, other forms and definitions of metadata are possible, and above tasks are offered only for illustrative purposes. While rich transcription adds a new emphasis on various forms of metadata annotation, it also maintains a strong focus on improving automatic speech recognition from a core word-error-rate point of view. This is reflected in the composition of the special issue, with about half the papers addressing ASR. Here, there is a great deal of current interest in topics such as discriminative training, the use of large amounts of training data, unsupervised and semisupervised training, and