Parallel Syntactic Annotation in CReST

In this paper, we introduce the syntactic annotation of the CReST corpus, a corpus of natural language dialogues obtained from humans performing a cooperative, remote search task. The corpus contains the speech signals as well as transcriptions of the dialogues, which are additionally annotated for dialogue structure, disfluencies, and for syntax. The syntactic annotation comprises POS annotation, Penn Treebank style constituent annotations, dependency annotations, and combinatory categorial grammar annotations. The corpus is the first of its kind, providing parallel syntactic annotation based on three different gram- mar formalisms for a dialogue corpus. All three annotations are manually corrected, thus providing a high quality resource for linguistic comparisons, but also for parser evaluation across frameworks.

[1]  Ludwig M. Eichinger,et al.  Levels of Dependency Description: Concepts and Problems , 2003 .

[2]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[3]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[4]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[5]  Cristina Bosco,et al.  Dependency and relational structure in treebank annotation , 2004 .

[6]  Cristina Bosco,et al.  Converting a dependency treebank to a categorial grammar treebank for Italian , 2009 .

[7]  Josef van Genabith,et al.  QuestionBank: Creating a Corpus of Parse-Annotated Questions , 2006, ACL.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[10]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[11]  Matthias Scheutz,et al.  The Indiana “Cooperative Remote Search Task” (CReST) Corpus , 2010, LREC.

[12]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[13]  Joakim Nivre,et al.  Characterizing the Errors of Data-Driven Dependency Parsing Models , 2007, EMNLP.

[14]  Gwyneth Doherty-Sneddon,et al.  THE HCRC MAP TASK CORPUS: Natural Dialogue For Speech Recognition , 1993, HLT.

[15]  Richard Hudson,et al.  English word grammar , 1995 .

[16]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[17]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.