SPLICR: A Sustainability Platform for Linguistic Corpora and Resources

We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s specific research needs. SPLICR also provides an interface that enables users to query and to visualise corpora. The project in which the system is being developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annotations. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collections of language resources.

[1]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Thomas Schmidt,et al.  Time-based data models and the Text Encoding Initiative’s guidelines for transcription of speech , 2005 .

[4]  Andreas Witt,et al.  Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections , 2007 .

[5]  Andreas Wagner Unity in Diversity: Integrating Differing Linguistic Data in TUSNELDA , 2005 .

[6]  Christian Chiarcos,et al.  Rechtsfragen bei der Nutzung und Weitergabe linguistischer Daten , 2007 .

[7]  Christian Chiarcos AN ONTOLOGY OF LINGUISTIC ANNOTATION : WORD CLASSES AND MORPHOLOGY , 2007 .

[8]  Christian Chiarcos An ontology of linguistic annotations , 2008, LDV Forum.

[9]  Heike Zinsmeister,et al.  Requirements of a user-friendly, general-purpose corpus query interface , 2008 .

[10]  Andreas Witt,et al.  Collecting Legally Relevant Metadata by Means of a Decision-Tree-Based Questionnaire System , 2007 .

[11]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[12]  Andreas Witt,et al.  Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications , 2007 .

[13]  Andreas Witt,et al.  E-MELD 2006 Workshop on Digital Language Documentation: Tools and Standards - The State of the Art Avoiding Data Graveyards: From Heterogeneous Data Collected in Multiple Research Projects to Sustainable Linguistic Resources , 2006 .

[14]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[15]  Geoffrey Sampson English for the computer , 1995 .

[16]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.

[17]  Andreas Witt,et al.  Sustainability of Linguistic Resources , 2006 .

[18]  Andreas Witt,et al.  Modelling Linguistic Data Structures , 2006 .

[19]  Andreas Witt,et al.  The Metadata-Database of a Next Generation Sustainability Web-Platform for Language Resources , 2008, LREC.

[20]  Christian Chiarcos,et al.  An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora , 2007 .

[21]  Andreas Witt,et al.  On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees , 2007 .

[22]  Christian Chiarcos,et al.  Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers , 2008, LREC.

[23]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[24]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[25]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[26]  Richard Eckart,et al.  An XML-based data model for flexible representation and query of linguistically interpreted corpora , 2007 .