Linguistically Annotated Corpus as an Invaluable Resource for Advancements in Linguistic Research: A Case Study

Abstract A case study based on experience in linguistic investigations using annotated monolingual and multilingual text corpora; the “cases” include a description of language phenomena belonging to different layers of the language system: morphology, surface and underlying syntax, and discourse. The analysis is based on a complex annotation of syntax, semantic functions, information structure and discourse relations of the Prague Dependency Treebank, a collection of annotated Czech texts. We want to demonstrate that annotation of corpus is not a self-contained goal: in order to be consistent, it should be based on some linguistic theory, and, at the same time, it should serve as a test bed for the given linguistic theory in particular and for linguistic research in general.

[1]  Association Focus , 1999 .

[2]  Charles J. Fillmore,et al.  Form And Meaning In Language , 2003 .

[3]  Eva Hajičová,et al.  Issues of Sentence Structure and Discourse Patterns. , 1993 .

[4]  Katrin Erk,et al.  The SALSA Corpus: a German Corpus Resource for Lexical Semantics , 2006, LREC.

[5]  Jarmila Panevová,et al.  The Role of Grammatical Constraints in Lexical Component in Functional Generative Description , 2014 .

[6]  Thorsten Brants,et al.  Inter-annotator Agreement for a German Newspaper Corpus , 2000, LREC.

[7]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[8]  Jan Hajic,et al.  Linguistic Annotation : from Links to Cross-Layer Lexicons , 2003 .

[9]  Zdeňka Urešová Valence sloves v Pražském závislostním korpusu , 2012 .

[10]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[11]  Eduard Bejček,et al.  Annotation of multiword expressions in the Prague dependency treebank , 2010, IJCNLP.

[12]  Marie Mikulová,et al.  Reconstructions of Deletions in a Dependency-based Description of Czech: Selected Issues , 2015, DepLing.

[13]  Eva Hajicová,et al.  Introducing the Prague Discourse Treebank 1.0 , 2013, IJCNLP.

[14]  Jirí Havelka,et al.  Identification of Topic and Focus in Czech: Evaluation of Manual Parallel Annotations , 2007, Prague Bull. Math. Linguistics.

[15]  Peri Bhaskararao,et al.  Non-nominative Subjects: Volume 1 , 2004 .

[16]  Ralph Grishman,et al.  Annotating Noun Argument Structure for NomBank , 2004, LREC.

[17]  Petr Sgall,et al.  The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[18]  Eva Hajicová,et al.  The Role of the Hierarchy of Activation in the Process of Natural Language Understanding , 1982, COLING.

[19]  Marie Mikulová,et al.  Ways of Evaluation of the Annotators in Building the Prague Czech-English Dependency Treebank , 2010, LREC.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[22]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[23]  Petr Sgall,et al.  Aktuální členění věty v češtině , 1980 .

[24]  Jirí Mírovský,et al.  Genres in the Prague Discourse Treebank , 2014, LREC.

[25]  Eva Hajičová,et al.  On an Apparent Freedom of Czech Word Order . A Case Study , 2015 .

[26]  Eva Hajicová,et al.  Annotators' Agreement: The Case of Topic-Focus Articulation , 2004, LREC.

[27]  Eva Hajičová,et al.  On scalarity in information structure , 2012 .

[28]  Jirí Mírovský,et al.  Does Tectogrammatics Help the Annotation of Discourse? , 2012, COLING.

[29]  Pavlína Jínová,et al.  Semi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT , 2012 .

[30]  Hilary Putnam,et al.  Mind, Language and Reality: Some issues in the theory of grammar , 1975 .

[31]  Petr Pajas,et al.  System for Querying Syntactically Annotated Corpora , 2009, ACL/IJCNLP.

[32]  Marie Mikulová,et al.  Deletions and Node Reconstructions in a Dependency-Based Multilevel Annotation Scheme , 2015, CICLing.

[33]  C. Fillmore The case for case reopened , 1977 .

[34]  Jirí Mírovský,et al.  How Dependency Trees and Tectogrammatics Help Annotating Coreference and Bridging Relations in Prague Dependency Treebank , 2013, DepLing.

[35]  Roman Jakobson,et al.  Structure of Language and Its Mathematical Aspects , 1961 .

[36]  Silvie Cinková,et al.  Tectogrammatical Annotation of the Wall Street Journal , 2009, Prague Bull. Math. Linguistics.

[37]  Petr Sgall,et al.  A functional approach to syntax: in generative description of language , 1969, Mathematical linguistics and automatic language processing.

[38]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[39]  Magdalena Rysova Verbs of Saying with a Textual Connecting Function in the Prague Discourse Treebank , 2014, LREC.

[40]  Sárka Zikánová What do the data in Prague Dependency Treebank say about systemic ordering in Czech? , 2006, Prague Bull. Math. Linguistics.

[41]  Petr Pajas,et al.  PDT-VALLEX : Creating a Large-coverage Valency Lexicon for Treebank Annotation , 2003 .

[42]  Petr Sgall Towards a Definition of Focus and Topic , 1981 .

[43]  Jan Haji,et al.  Morphological and Syntactic Tagging of the Prague Dependency Treebank , 1999 .

[44]  Jirí Mírovský,et al.  Sentence Modality Assignment in the Prague Dependency Treebank , 2012, TSD.

[45]  Eva Hajicová,et al.  Corpus Annotation on the Tectogrammatical Layer: Summarizing of the First Stages of Evaluations , 2002, Prague Bull. Math. Linguistics.

[46]  Wekesa L Maloba,et al.  Aspects of discourse structure , 2012 .

[47]  Jirí Mírovský,et al.  Connective-Based Measuring of the Inter-Annotator Agreement in the Annotation of Discourse in PDT , 2010, COLING.

[48]  Magdalena Rysova Alternative Lexicalizations of Discourse Connectives in Czech , 2012, LREC.

[49]  P. Luelsdorff The Prague School of Structural and Functional Linguistics , 1994 .

[50]  Livio Robaldo,et al.  The Penn Discourse Treebank 2.0 Annotation Manual , 2007 .

[51]  P. Sgall,et al.  Recenze: Markéta Lopatková – Zdeněk Žabokrtský – Václava Kettnerová: Valenční slovník českých sloves. Praha: Karolinum, 2008. 381 s. , 2010 .

[52]  Steven J. Clancy,et al.  The Chain of Being and Having in Slavic , 2010 .

[53]  Anna Nedoluzhko,et al.  Rozšířená textová koreference a asociační anafora (koncepce anotace českých dat v pražském závislostním korpusu) , 2010 .

[54]  P. Sgall,et al.  Topic-focus articulation, tripartite structures, and semantic content , 1998 .

[55]  Jan Hajic,et al.  Annotation Lexicons: Using the Valency Lexicon for Tectogrammatical Annotation , 2003, Prague Bull. Math. Linguistics.

[56]  Jan Haji Complex Corpus Annotation: The Prague Dependency Treebank , 2005 .