A Web-Platform for Preserving, Exploring, Visualising, and Querying Linguistic Corpora and other Resources

We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one's specific research needs. SPLICR also provides a graphical interface that enables users to query and to visualise corpora. The project in which the system is developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annotations. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collections of language resources.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Andreas Witt,et al.  Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections , 2007 .

[3]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[4]  Andreas Witt,et al.  Sustainability of Linguistic Resources , 2006 .

[5]  Andreas Witt,et al.  Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications , 2007 .

[6]  Richard Eckart,et al.  An XML-based data model for flexible representation and query of linguistically interpreted corpora , 2007 .

[7]  Andreas Witt,et al.  Modelling Linguistic Data Structures , 2006 .

[8]  李幼升,et al.  Ph , 1989 .

[9]  Christian Chiarcos,et al.  Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers , 2008, LREC.

[10]  Christian Chiarcos AN ONTOLOGY OF LINGUISTIC ANNOTATION : WORD CLASSES AND MORPHOLOGY , 2007 .

[11]  Andreas Witt,et al.  The Metadata-Database of a Next Generation Sustainability Web-Platform for Language Resources , 2008, LREC.

[12]  Geoffrey Sampson,et al.  English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[13]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[14]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[15]  Thomas Schmidt,et al.  Time-based data models and the Text Encoding Initiative’s guidelines for transcription of speech , 2005 .

[16]  Heike Zinsmeister,et al.  Requirements of a user-friendly, general-purpose corpus query interface , 2008 .

[17]  Andreas Witt,et al.  On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees , 2007 .

[18]  Geoffrey Sampson English for the computer , 1995 .

[19]  Laurent Romary,et al.  CES/XML : An XML-based Standard for Linguistic Corpora , 2000 .

[20]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[21]  Andreas Witt,et al.  Collecting Legally Relevant Metadata by Means of a Decision-Tree-Based Questionnaire System , 2007 .

[22]  Andreas Witt,et al.  E-MELD 2006 Workshop on Digital Language Documentation: Tools and Standards - The State of the Art Avoiding Data Graveyards: From Heterogeneous Data Collected in Multiple Research Projects to Sustainable Linguistic Resources , 2006 .

[23]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[24]  Christian Chiarcos,et al.  Rechtsfragen bei der Nutzung und Weitergabe linguistischer Daten , 2007 .

[25]  Stavros Skopeteas,et al.  Information Structure in Cross-Linguistic Corpora: , 2007 .

[26]  Christian Chiarcos,et al.  An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora , 2007 .