Linked Data in Linguistics

In this paper we describe a practical approach to the challenge of linguistic retrodigitization. We propose to distinguish strictly between a base digitization and separate interpretation of the sources. The base digitization only includes a literal electronic transcript of the source. All sources are thus simply treated as strings of characters, i.e. as unstructured corpora. The often complex structure as found in many dictionaries and grammars will subsequently (and possibly much later) be added as Linked Data in the form of standoff annotation. A further advantage of this approach is that the complete digitization and interpretation can be performed collaboratively without a complex organizational superstructure.

[1]  Antonio Pareja-Lora,et al.  Modelling Discourse-related terminology in OntoLingAnnot’s ontologies , 2010 .

[2]  Nicholas Evans,et al.  Realizing Humboldt’s dream: Cross-linguistic grammatography as data-base creation , 2006 .

[3]  Michael Schiehlen,et al.  Optimizing Algorithms for Pronoun Resolution , 2004, COLING.

[4]  Wolfgang Lezius,et al.  A Description Language for Syntactically Annotated Corpora , 2000, COLING.

[5]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[6]  Ian Maddieson,et al.  Patterns of sounds , 1986 .

[7]  Marc Kemps-Snijders,et al.  ISOcat: Corralling Data Categories in the Wild , 2008, LREC.

[8]  Michael Carl,et al.  Ten years of Translog , 2009 .

[9]  Christian Chiarcos An ontology of linguistic annotations , 2008, LDV Forum.

[10]  D. Terence Langendoen,et al.  An OWL-DL Implementation of Gold An Ontology for the Semantic Web , 2010 .

[11]  Eduard Hovy,et al.  Parsimonious or Profligate: How Many and Which Discourse Structure Relations? , 1992 .

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Manfred Stede,et al.  SUMMaR: Combining Linguistics and Statistics for Text Summarization , 2006, ECAI.

[14]  Christian Chiarcos,et al.  An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora , 2007 .

[15]  Gary Simons,et al.  The OLAC Metadata Set and Controlled Vocabularies , 2001, ACL 2001.

[16]  Christian Chiarcos,et al.  OWL/DL formalization of the MULTEXT-East morphosyntactic specifications , 2011, Linguistic Annotation Workshop.

[17]  Michael Carl,et al.  Inside the Monitor Model: Processes of Default and Challenged Translation Production , 2012 .

[18]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[19]  S. Rapp Automatic Phonemic Transcription and Linguistic Annotation from Known Text with Hidden Markov Models , 1995 .

[20]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[21]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[22]  Menzo Windhouwer,et al.  Explicit Semantics for Enriched Documents. What do ISOcat, RELcat and SCHEMAcat have to offer? , 2011 .

[23]  Richard Wright,et al.  Phonetics Information Base and Lexicon , 2012 .

[24]  Antonio Pareja-Lora,et al.  Ontology-based Interoperation of Linguistic Tools for an Improved Lemma Annotation in Spanish , 2010, LREC.

[25]  Barbara Lust,et al.  Constructing adequate language documentation for multifaceted cross-linguistic data: A case study from the virtual center for study of language acquisition , 2010 .

[26]  Gail Steinhart,et al.  DataStaR: Using the Semantic Web approach for Data Curation , 2011, Int. J. Digit. Curation.

[27]  Katrin Erk,et al.  A Powerful and Versatile XML Format for Representing Role-semantic Annotation , 2004, LREC.

[28]  Jean Carletta,et al.  The NITE XML Toolkit: Data Model and Query Language , 2005, Lang. Resour. Evaluation.

[29]  Antonio Pareja-Lora,et al.  OntoTag’s linguistic ontologies as a reference for semantic web annotations , 2004 .

[30]  Brian Lowe DataStaR: Bridging XML and OWL in Science Metadata Management , 2009, MTSR.

[31]  Jens Lehmann,et al.  LinkedGeoData: A core for a web of spatial open data , 2012, Semantic Web.

[32]  Thierry Declerck,et al.  Towards a Standardized Linguistic Annotation of the Textual Content of Labels in Knowledge Representation Systems , 2010, LREC.

[33]  Tom Heath,et al.  How to Publish Linked Data on the Web - Proposal for a Half-day Tutorial at ISWC2008 , 2008 .

[34]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[35]  Laurent Romary,et al.  A model oriented approach to the mapping of annotation formats using standards , 2010 .

[36]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[37]  Steve Cassidy An RDF realisation of LAF in the DADA annotation server , 2010, ACL 2010.

[38]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[39]  Emily M. Bender,et al.  Computational Linguistics in Support of Linguistic Theory , 2010 .

[40]  Nancy Ide,et al.  What Does Interoperability Mean , Anyway ? Toward an Operational Definition of Interoperability for Language Technology , 2010 .

[41]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[42]  Christian Chiarcos,et al.  Ontology-Based Interface Specifications for a NLP Pipeline Architecture , 2008, LREC.

[43]  Rob Goedemans,et al.  Distributed tasking in ontology mediated integration of typological databases for linguistic research , 2005 .

[44]  Christian Chiarcos,et al.  The Open Linguistics Working Group , 2012, LREC.

[45]  Christian Chiarcos,et al.  ANNIS: A Search Tool for Multi-Layer Annotated Corpora , 2009 .

[46]  Jens Lehmann,et al.  LinkedGeoData: Adding a Spatial Dimension to the Web of Data , 2009, SEMWEB.

[47]  Laurent Romary,et al.  International standard for a linguistic annotation framework , 2003, HLT-NAACL 2003.

[48]  Steven Bird,et al.  Towards a general model of interlinear text , 2003 .

[49]  Christian Rohrer,et al.  Improving coverage and parsing quality of a large-scale LFG for German , 2006, LREC.

[50]  Christian Chiarcos,et al.  By all these lovely tokens... Merging conflicting tokenizations , 2009, Lang. Resour. Evaluation.

[51]  Thomas Schmidt EXMARaLDA - ein System zur computergestützten Diskurstranskription , 2004 .

[52]  Ellen F. Prince,et al.  Toward a taxonomy of given-new information , 1981 .

[53]  Jean Véronis,et al.  Text Encoding Initiative: Background and Contexts , 1995 .

[54]  William Lewis,et al.  The Semantics of Markup: Mapping Legacy Markup Schemas to a Common Semantics , 2004, NLPXML@ACL.

[55]  Nataša Pavlović,et al.  Eye tracking translation directionality , 2009 .

[56]  Vladimir Pericliev Machine-Aided Linguistic Discovery: An Introduction and Some Examples , 2010 .

[57]  Luciano Serafini,et al.  Supporting Natural Language Processing with Background Knowledge: Coreference Resolution Case , 2010, International Semantic Web Conference.

[58]  Stefanie Dipper,et al.  XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation , 2005, Berliner XML Tage.

[59]  Martin Hepp,et al.  Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management , 2007, IEEE Internet Computing.

[60]  Marc Kemps-Snijders,et al.  ISOcat: remodelling metadata for language resources , 2009, Int. J. Metadata Semant. Ontologies.

[61]  Jeff Good,et al.  Modeling contested categorization in linguistic databases , 2006 .

[62]  Cecil H. Brown,et al.  Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[63]  Christian Chiarcos,et al.  A Flexible Framework for Integrating Annotations from Different Tools and Tagsets , 2008 .

[64]  David Nathan,et al.  Towards portability and interoperability for linguistic annotation and language-specific ontologies , 2005 .

[65]  Thomas R. Gruber,et al.  A Translation Approach to Portable Ontologies , 1993 .

[66]  Asunción Gómez-Pérez,et al.  OntoTag's linguistic ontologies: improving semantic Web annotations for a better language understanding in machines , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[67]  Karlheinz Mörth,et al.  Accessing Multilingual Data on the Web for the Semantic Annotation of Cultural Heritage Texts , 2011, MSW.

[68]  Gerhard Weikum,et al.  Language as a Foundation of the Semantic Web , 2008, SEMWEB.

[69]  William D. Lewis ODIN: A Model for Adapting and Enriching Legacy Infrastructure , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[70]  Juliette Blevins,et al.  Another Universal Bites the Dust: Northwest Mekeo Lacks Coronal Phonemes , 2009 .

[71]  Harald Hammarström,et al.  Automated Dating of the World’s Language Families Based on Lexical Similarity , 2011, Current Anthropology.

[72]  Gary Simons,et al.  The Open Language Archives Community: An Infrastructure for Distributed Archiving of Language Resources , 2003, Lit. Linguistic Comput..

[73]  Michael D. Gordon,et al.  Recall-precision trade-off: A derivation , 1989, JASIS.

[74]  Nina Seemann,et al.  A Recursive Annotation Scheme for Referential Information Status , 2010, LREC.

[75]  Jörg Mayer,et al.  TRANSCRIPTION OF GERMAN INTONATION , 1995 .

[76]  Ulf Leser,et al.  Fast and practical indexing and querying of very large graphs , 2007, SIGMOD '07.

[77]  Laurent Romary,et al.  [tiger2/]- Serialising the ISO SynAF Syntactic Object Model , 2011, ArXiv.

[78]  Katrin Erk,et al.  SALTO - A Versatile Multi-Level Annotation Tool , 2006, LREC.

[79]  Jeff Good,et al.  Modeling and Encoding Traditional Wordlists for Machine Applications , 2010 .

[80]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[81]  Gail Steinhart DataStaR: a data staging repository to support the sharing and publication of research data , 2010 .

[82]  Erhard W. Hinrichs,et al.  Foundation of a Component-based Flexible Registry for Language Resources and Technology , 2008, LREC.

[83]  Henrik Høeg Müller,et al.  The Copenhagen Dependency Treebanks. Forskellige niveauer - samme relationer , 2011 .

[84]  Johan Rooryck,et al.  Editorial introduction to the special issue of Lingua on Evans & Levinson's “The myth of language universals” , 2010 .

[85]  Arnt Lykke Jakobsen,et al.  Logging target text production with Translog , 1999 .

[86]  Barbara Lust,et al.  Data Transcription and Analysis Tool. User's Manual. , 2013 .

[87]  P. Lewis Ethnologue : languages of the world , 2009 .

[88]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[89]  M. Trautner,et al.  The Danish Dependency Treebank and the DTAG Treebank Tool , 2003 .

[90]  Andreas Witt,et al.  E-MELD 2006 Workshop on Digital Language Documentation: Tools and Standards - The State of the Art Avoiding Data Graveyards: From Heterogeneous Data Collected in Multiple Research Projects to Sustainable Linguistic Resources , 2006 .

[91]  E. Prince The ZPG Letter: Subjects, Definiteness, and Information-status , 1992 .

[92]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[93]  Geoffrey Leech,et al.  EAGLES recommendations for the morphosyntactic annotation of corpora , 1996 .

[94]  Andrea Giovanni Nuzzolese,et al.  Gathering lexical linked data and knowledge patterns from FrameNet , 2011, K-CAP '11.

[95]  Christian Chiarcos Towards Robust Multi-Tool Tagging. An OWL/DL-Based Approach , 2010, ACL.

[96]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[97]  Magdalena Romera Discourse Functional Units. The expression of coherence relations in spoken Spanish , 2004 .

[98]  James Pustejovsky,et al.  Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference , 2005, FCA@ACL.

[99]  Jost Gippert,et al.  RELISH: RENDERING ENDANGERED LANGUAGES LEXICONS INTEROPERABLE THROUGH STANDARDS HARMONIZATION , 2012 .

[100]  Philipp Cimiano,et al.  Linking Lexical Resources and Ontologies on the Semantic Web with Lemon , 2011, ESWC.

[101]  W. N. Borst,et al.  Construction of Engineering Ontologies for Knowledge Sharing and Reuse , 1997 .

[102]  S. Farrar,et al.  Markup and the GOLD Ontology , 2003 .

[103]  Christian Chiarcos,et al.  The TIGER Corpus Navigator , 2010 .

[104]  M. Beardsley Expression and Meaning: Studies in the Theory of Speech Acts , 1981 .

[105]  Wojciech Skut,et al.  A Linguistically Interpreted Corpus of German Newspaper Text , 1998, LREC.