E-MELD 2006 Workshop on Digital Language Documentation: Tools and Standards - The State of the Art Avoiding Data Graveyards: From Heterogeneous Data Collected in Multiple Research Projects to Sustainable Linguistic Resources

This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. The initiative is a cooperation between three collaborative research centres in Germany – the SFB 441 “Linguistic Data Structures” in Tubingen, the SFB 538 “Multilingualism” in Hamburg, and the SFB 632 “Information Structure” in Potsdam/Berlin. The aim of the project is to develop methods for sustainable archiving of the diverse bodies of linguistic data used at the three sites. In the first half of the paper, the data handling solutions developed so far at the three centres are briefly introduced. This is followed by an assessment of their commonalities and differences and of what these entail for the work of the new joint initiative. The second part sketches seven areas of open questions with respect to sustainable data handling and gives more detailed accounts of two of them – integration of linguistic terminologies and development of best practice guidelines.

[1]  Thomas Schmidt,et al.  Handbuch für das computergestützte Transkribieren nach HIAT , 2004 .

[2]  Roland Hinterhölzl,et al.  Rhetorical Relations and Verb Placement in Early Germanic Languages Evidence from the Old High German Tatian Translation ( 9 th century ) , 2005 .

[3]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[4]  Andreas Witt,et al.  Multiple hierarchies: new aspects of an old solution. Re-published , 2005 .

[5]  David Reitter,et al.  Step by step: underspecified markup in incremental rhetorical analysis , 2003, LINC@EACL.

[6]  Christian Chiarcos,et al.  PoCoS - Potsdam Coreference Scheme , 2007, LAW@ACL.

[7]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[8]  Frederick B. Thompson,et al.  English for the computer , 1899, AFIPS '66 (Fall).

[9]  Irene Bloemraad UNITY IN DIVERSITY? , 2007, Du Bois Review: Social Science Research on Race.

[10]  Gary Simons,et al.  Seven Dimensions of Portability for Language Documentation and Description , 2002, ArXiv.

[11]  Andreas Witt,et al.  Sustainability of Linguistic Resources , 2006 .

[12]  Roger Garside,et al.  An Arabic tagset for the morphosyntactic tagging of Arabic , 2001 .

[13]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[14]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[15]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[16]  Geoffrey Leech,et al.  EAGLES recommendations for the morphosyntactic annotation of corpora , 1996 .

[17]  JÜRGEN BROSCHART,et al.  Why Tongan does it differently: Categorial distinctions in a language without nouns and verbs , 1997 .

[18]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[19]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[20]  Michael Strube,et al.  MMAX: A Tool for the Annotation of Multi-modal Corpora , 2001, IJCAI 2001.

[21]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[22]  Thomas Schmidt,et al.  Time-based data models and the Text Encoding Initiative’s guidelines for transcription of speech , 2005 .

[23]  Eric Atwell,et al.  Automatic Extraction of Tagset Mappings from Parallel-Annotated Corpora , 1995, ArXiv.

[24]  Andreas Witt,et al.  Methods for the semantic analysis of document markup , 2003, DocEng '03.

[25]  Andreas Wagner,et al.  A Syntactically Annotated Corpus of Tibetan , 2004, LREC.

[26]  Wojciech Skut,et al.  A Linguistically Interpreted Corpus of German Newspaper Text , 1998, LREC.

[27]  Thomas Schmidt Computergestützte Transkription: Modellierung und Visualisierung gesprochener Sprache mit texttechnologischen Mitteln , 2005 .

[28]  Manfred Stede,et al.  ANNIS: A Linguistic Database for Exploring Information Structure , 2004 .