Centering Theory in Spanish: Coding Manual

0. Introduction This is a manual for coding Centering Theory (Grosz et al., 1995) in Spanish. The manual is still under revision. The coding is being done on two sets of corpora: • ISL corpus. A set of task-oriented dialogues in which participants try to find a date where they can meet. Distributed by the Interactive Systems Lab at Carnegie Mellon University. Transcription conventions for this corpus can be found in Appendix A. • CallHome corpus. Spontaneous telephone conversations, distributed by the Linguistics Data Consortium at the University of Pennsylvania. Information about this corpus can be obtained from the LDC. This manual provides guidelines for how to segment discourse (Section 1), what to include in the list of forward-looking centers (Section 2), and how to rank the list (Section 3). In Section 4, we list some unresolved issues. 1. Utterance segmentation 1.1 Utterance In this section, we discuss how to segment discourse into utterances. Besides general segmentation of coordinated and subordinated clauses, we discuss how to treat some spoken language phenomena, such as false starts. In general, an utterance U is a tensed clause. Because we are analyzing telephone conversations, a turn may be a clause or it may be not. For those cases in which the turn is not a clause, a turn is considered an utterance if it contains entities. The first pass in segmentation is to break the speech into intonation units. For the ISL corpus, an utterance U is defined as an intonation unit marked by either {period}, {quest} or {seos} (see Appendix A for details on transcription). Note that {comma}, unless it is followed by {seos}, does not define an utterance. In the example below, (1c.) corresponds to the beginning of a turn by a different speaker. However, even though (1c.) is not a tensed clause, it is treated as an utterance because it contains entities, it is followed by {comma} {seos}, and it does not seem to belong to the following utterance.