In very large and diverse scientific projects where as different groups as linguists and engineers with different intentions work on the same signal data or its orthographic transcript and annotate new valuable information, it will not be easy to build a homogeneous corpus. We will describe how this can be achieved, considering the fact that some of these annotations have not been updated properly, or are based on erroneous or deliberately changed versions of the basis transcription. We used an algorithm similar to dynamic programming to detect differences between the transcription on which the annotation depends and the reference transcription for the whole corpus. These differences are automatically mapped on a set of repair operations for the transcriptions such as splitting compound words and merging neighbouring words. On the basis of these operations the correction process in the annotation is carried out. It always depends on the type of the annotation as well as on the position and the nature of the difference, whether a correction can be carried out automatically or has to be fixed manually. Finally we present a investigation in which we exploit the multi-tier annotations of the Verbmobil corpus to find out how breathing is correlated with prosodic-syntactic boundaries and dialog acts.
[1]
Norbert Reithinger,et al.
Dia logue Acts in VERBMOBIL-2 Second Edition
,
1997
.
[2]
A. Winkworth,et al.
Breathing patterns during spontaneous speech.
,
1995,
Journal of speech and hearing research.
[3]
Florian Schiel,et al.
The partitur format at BAS
,
1997
.
[4]
Elmar Nöth,et al.
M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases
,
1998,
Speech Commun..
[5]
Mark Liberman,et al.
A formal framework for linguistic annotation
,
1999,
Speech Commun..
[6]
Florian Schiel,et al.
Pronuncation modeling applied to automatic segmentation of spontaneous speech
,
1997,
EUROSPEECH.
[7]
F. Goldman-Eisler,et al.
Temporal Patterns of Cognitive Activity and Breath Control in Speech
,
1965,
Language and speech.
[8]
E. Hinrichs,et al.
The Tübingen Treebanks for Spoken German, English, and Japanese
,
2000
.
[9]
Jonathan Harrington,et al.
EMU: an Enhanced Hierarchical Speech Data Management System
,
1996
.
[10]
Wolfgang Wahlster,et al.
Verbmobil: Foundations of Speech-to-Speech Translation
,
2000,
Artificial Intelligence.