ANNIS: A Search Tool for Multi-Layer Annotated Corpora

ANNIS (see Dipper & Gotze 2005; Chiarcos et al. 2008) is a flexible web-based corpus architecture for search and visualization of multi-layer linguistic corpora. By multi-layer we mean that the same primary datum may be annotated independently with (i) annotations of different types (spans, DAGs with labelled edges and arbitrary pointing relations between terminals or non-terminals), and (ii) annotation structures that possibly overlap and/or conflict hierarchically. In this paper we present the different features of the architecture as well as actual use cases for corpus linguistic research on such diverse areas as information structure, learner language and discourse level phenomena. The supported search functionalities of ANNIS2 include exact and regular expression matching on word forms and annotations, as well as complex relations between individual elements, such as all forms of overlapping, contained or adjacent annotation spans, hierarchical dominance (children, ancestors, leftor rightmost child etc.) and more. Alternatively to the query language, data can be accessed using a graphical query builder. Query matches are visualized depending on annotation types: annotations referring to tokens (e.g. lemma, POS, morphology) are shown immediately in the match list. Spans (covering one or more tokens) are displayed in a grid view, trees/graphs in a tree/graph view, and pointing relations (such as anaphoric links) in a discourse view, with same-colour highlighting for coreferent elements. Full Unicode support is provided and a media player is embedded for rendering audio files linked to the data, allowing for a large variety of corpora. Corpus data is annotated with automatic tools (taggers, parsers etc.) or taskspecific expert tools for manual annotation, and then mapped onto the interchange format PAULA (Dipper 2005), where stand-off annotations refer to the same primary data. Importers exist for many formats, including EXMARaLDA (Schmidt 2004), TigerXML (Brants & Plaehn 2000), MMAX2 (Muller & Strube 2006), RSTTool (O’Donnell 2000), PALinkA (Orasan 2003) and Toolbox (Stuart et al. 2007). Data is compiled into a relational DB for optimal performance. Query matches and their features can also be exported in the ARFF format and processed with the data mining tool WEKA (Witten & Frank 2005), which offers implementations of clustering and classification algorithms. ANNIS2 compares favourably with search functionalities in the above tools as well as other corpus search engines (EXAKT, http://www.exmaralda.org/exakt.html, TIGERSearch, Lezius,2002, CWB, Christ 1994) and other frameworks/architectures (NITE, Carletta et al. 2003, GATE, Cunningham, 2002).

[1]  Laurent Romary,et al.  International standard for a linguistic annotation framework , 2003, HLT-NAACL 2003.

[2]  Ulf Leser,et al.  Storing and Querying Historical Texts in a Relational Database , 2005 .

[3]  Christian Chiarcos,et al.  Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers , 2008, LREC.

[4]  Anke Lüdeling,et al.  What’s Hard? Quantitative Evidence for Difficult Constructions in German Learner Data , 2008 .

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[7]  Seanna Doolittle,et al.  Das Lernerkorpus Falko , 2008 .

[8]  Stefanie Dipper,et al.  XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation , 2005, Berliner XML Tage.

[9]  Christian Chiarcos,et al.  An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora , 2007 .

[10]  S. Lukas Challenges in Modelling a Richly Annotated Diachronic Corpus of German , 2004 .

[11]  Christa Dürscheid Syntax: Grundlagen und Theorien , 2010 .

[12]  Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes , 2006, CL.

[13]  Jef L. Teugels,et al.  Challenges in modelling stochasticity in wind , 2002 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Katharina Hartmann,et al.  Information structure in African languages: corpora and tools , 2009, Lang. Resour. Evaluation.

[16]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[17]  Marga Reis,et al.  Linguistic evidence : empirical, theoretical, and computational perspectives , 2005 .

[18]  Petr Pajas,et al.  System for Querying Syntactically Annotated Corpora , 2009, ACL/IJCNLP.

[19]  Patrick Grommes,et al.  Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora , 2008 .

[20]  Seanna Doolittle,et al.  Entwicklung und Evaluierung eines auf dem Stellungsfeldermodell basierenden syntaktischen Annotationsverfahrens für Lernerkorpora innerhalb einer Mehrebenen-Architektur mit Schwerpunkt auf schriftlichen Texten fortgeschrittener Deutschlerner , 2008 .

[21]  Niels Ole Bernsen,et al.  THE NITE WORKBENCH. A Tool for Annotation of Natural Interactivity and Multimodal Data , 2002, LREC.

[22]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[23]  Jean Carletta,et al.  The NITE XML Toolkit: Data Model and Query Language , 2005, Lang. Resour. Evaluation.

[24]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[25]  Ian Witten,et al.  Data Mining , 2000 .

[26]  Laurent Romary,et al.  Parallel alignment of structured documents , 2000 .

[27]  Sylviane Granger,et al.  Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching , 2002 .

[28]  Renata Vieira,et al.  From manual to automatic annotation of coreference , 2003 .

[29]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[30]  Torsten Grust,et al.  Accelerating XPath evaluation in any RDBMS , 2004, TODS.

[31]  Steven Bird,et al.  Managing Fieldwork Data with Toolbox and the Natural Language Toolkit , 2007 .

[32]  Geert-Jan M. Kruijff,et al.  Discourse-level Annotation for Investigating Information Structure , 2004, ACL 2004.

[33]  Christian Chiarcos,et al.  A Flexible Framework for Integrating Annotations from Different Tools and Tagsets , 2008 .

[34]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[35]  Piotr Banski,et al.  Stand-off TEI Annotation: the Case of the National Corpus of Polish , 2009, Linguistic Annotation Workshop.

[36]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[37]  Geoffrey Leech,et al.  Introducing corpus annotation , 1997 .

[38]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[39]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[40]  Thorsten Brants,et al.  Interactive Corpus Annotation , 2000, LREC.

[41]  Petr Pajas,et al.  Recent Advances in a Feature-Rich Framework for Treebank Annotation , 2008, COLING.

[42]  Andreas Witt,et al.  A Web-Platform for Preserving, Exploring, Visualising, and Querying Linguistic Corpora and other Resources , 2008, Proces. del Leng. Natural.

[43]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[44]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[45]  Christian Chiarcos,et al.  Building and Using a Richly Annotated Interlinear Diachronic Corpus: The Case of Old High German Tatian , 2009, TAL.

[46]  Stefanie Dipper,et al.  Accessing Heterogeneous Linguistic Data — Generic XML-based Representation and Flexible Visualization , 2004 .

[47]  Rachel Panckhurst,et al.  Traitement automatique des langues. , 2001 .

[48]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[49]  Karin Harbusch,et al.  The relationship between grammaticality ratings and corpus frequencies: A case study into word order variability in the midfield of German clauses , 2005 .

[50]  Manfred Stede,et al.  SUMMaR: Combining Linguistics and Statistics for Text Summarization , 2006, ECAI.

[51]  Constantin Orasan,et al.  PALinkA: A highly customisable tool for discourse annotation , 2003, SIGDIAL Workshop.

[52]  Thomas C. Schmidt Transcribing and annotating spoken language with EXMARaLDA , 2004 .

[53]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[54]  Hamish Cunningham GATE, a General Architecture for Text Engineering , 2002 .

[55]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[56]  Manfred Krifka,et al.  Basic notions of information structure , 2008 .

[57]  Nils Diewald,et al.  Serengeti - Webbasierte Annotation semantischer Relationen , 2008, J. Lang. Technol. Comput. Linguistics.