Community Standards for Linguistically-Annotated Resources

This chapter provides a broad overview of the state-of-the-art in standards development for language resources, beginning with a brief historical overview to serve as context. It describes in some detail several current, major efforts that define the standardization landscape for language resources today, with the aim of outlining their differences and commonalities and, more generally, identifying the progress that has been made to date as well as the obstacles to definitive standardization. In addition to describing standards that are most applicable to linguistic annotation of text, we include a section that overviews considerations and alternatives for spoken data. We also overview a widely-used and influential de facto standard and consider its role in standards development. Finally, we provide an assessment of the standards landscape and the options available to current and future creators of linguistically-annotated resources.

[1]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[2]  James Pustejovsky,et al.  Annotating, Extracting and Reasoning about Time and Events, International Seminar, Dagstuhl Castle, Germany, April 10-15, 2005. Revised Papers , 2007, Annotating, Extracting and Reasoning about Time and Events.

[3]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[4]  James Pustejovsky,et al.  The Language Application Grid Web Service Exchange Vocabulary , 2014, WLSI.

[5]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[6]  Adil El Ghali,et al.  TELIX: An RDF-Based Model for Linguistic Annotation , 2012, ESWC.

[7]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  David R. Traum,et al.  20 Questions on Dialogue Act Taxonomies , 2000, J. Semant..

[10]  Daniel Zeman,et al.  Reusable Tagset Conversion Using Tagset Drivers , 2008, LREC.

[11]  A. D. Dominicis,et al.  Intonation Systems: A Survey of Twenty Languages , 1999 .

[12]  Neville Ryant,et al.  A Large-scale Classication of English Verbs , 2006 .

[13]  Judith Eckle-Kohler,et al.  UBY‐LMF – Exploring the Boundaries of Language‐Independent Lexicon Models , 2013 .

[14]  Nancy Ide,et al.  What Does Interoperability Mean , Anyway ? Toward an Operational Definition of Interoperability for Language Technology , 2010 .

[15]  Christiane Fellbaum,et al.  The Manually Annotated Sub-Corpus: A Community Resource for and by the People , 2010, ACL.

[16]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[17]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[18]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[19]  Harry Bunt,et al.  A methodology for designing semantic annotations , 2013 .

[20]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[21]  Gil Francopoulo,et al.  LMF lexical markup framework , 2013 .

[22]  Daniel Hirst,et al.  SPeech Phonetization Alignment and Syllabification (SPPAS): a tool for the automatic analysis of speech prosody , 2012 .

[23]  Daniel Zeman,et al.  HamleDT: To Parse or Not to Parse? , 2012, LREC.

[24]  Marie-Francine Moens,et al.  Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation , 2000, Computational Linguistics.

[25]  Raphaël Troncy,et al.  NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud , 2012, LDOW.

[26]  Laurent Romary,et al.  Towards International Standards for Language Resources Nancy Ide and Laurent Romary , 2007 .

[27]  Iryna Gurevych,et al.  UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF , 2012, EACL.

[28]  Dafydd Gibbon,et al.  Handbook of Multimodal and Spoken Dialogue Systems , 2000 .

[29]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[30]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[31]  Lenore A. Grenoble,et al.  Current Trends in Language Documentation , 2014 .

[32]  Fabio Vitali,et al.  Annotations with EARMARK for arbitrary, overlapping and out-of order markup , 2009, DocEng '09.

[33]  Thomas Schmidt A TEI-based Approach to Standardising Spoken Language Transcription , 2011 .

[34]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[35]  Christian Chiarcos Ontologies of Linguistic Annotation: Survey and perspectives , 2012, LREC.

[36]  James Pustejovsky,et al.  ISO-TimeML: An International Standard for Semantic Annotation , 2010, LREC.

[37]  Harry Bunt,et al.  Discourse relations in dialogue , 2011 .

[38]  N. Rossini Reinterpreting gesture as language : language "in action" , 2012 .

[39]  Adam Przepiórkowski,et al.  The Design of Syntactic Annotation Levels in the National Corpus of Polish , 2010, LREC.

[40]  Joakim Nivre,et al.  Dependency Parsing , 2009, Lang. Linguistics Compass.

[41]  Harry Bunt,et al.  From TimeML to Interval Temporal Logic , 2007 .

[42]  Nancy Ide,et al.  The Linguistic Annotation Framework: a standard for annotation interchange and merging , 2014, Lang. Resour. Evaluation.

[43]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[44]  Inderjeet Mani,et al.  SpatialML: Annotation Scheme, Corpora, and Tools , 2008, LREC.

[45]  Nancy Ide,et al.  A Registry of Standard Data Categories for Linguistic Annotation , 2004, LREC.

[46]  Kôiti Hasida,et al.  ISO 24617-2: A semantically-based standard for dialogue annotation , 2012, LREC.

[47]  Steven B. Chin,et al.  Transcribing the speech of children with cochlear implants: clinical application of narrow phonetic transcriptions. , 2009, American journal of speech-language pathology.

[48]  Menzo Windhouwer,et al.  LMF and the Data Category Registry: Principles and Application , 2013 .

[49]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[50]  Ewan Klein,et al.  Phonological events , 1990, Journal of Linguistics.

[51]  Nancy Ide,et al.  Encoding dictionaries , 1995, Comput. Humanit..

[52]  Harry Bunt,et al.  LIRICS Semantic Role Annotation: Design and Evaluation of a Set of Data Categories , 2008, LREC.

[53]  Von der Fakult Language Engineering for Information Extraction , 2011 .

[54]  Laurent Romary,et al.  Towards International Standards for Language Resources , 2007 .

[55]  Dafydd Gibbon,et al.  Spoken language system and corpus design , 1998 .

[56]  Menzo Windhouwer,et al.  RELcat: a Relation Registry for ISOcat data categories , 2012, LREC.

[57]  Sebastian Hellmann,et al.  The Web of Data : Decentralized , collaborative , interlinked and interoperable , 2012 .

[58]  Simon Krek,et al.  The JOS Linguistically Tagged Corpus of Slovene , 2010, LREC.

[59]  Patrick Paroubek,et al.  LMF Lexical Markup Framework: Francopoulo/LMF Lexical Markup Framework , 2013 .

[60]  Jean Véronis,et al.  Text Encoding Initiative , 1995, Springer Netherlands.

[61]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[62]  Jean Carletta,et al.  HCRC dialogue structure coding manual , 1995 .

[63]  Edward Gibson,et al.  Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch) , 2012 .

[64]  Nancy Ide,et al.  Standards for Language Resources , 2002, LREC.

[65]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[66]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[67]  Marcin Wlodarczak,et al.  TextGridTools: A TextGrid Processing and Analysis Toolkit for Python , 2013 .

[68]  Laurent Romary,et al.  Parallel alignment of structured documents , 2000 .

[69]  Hiroaki Sato,et al.  FrameNet as a “Net” , 2004, LREC.

[70]  Jens Allwood On Dialogue Cohesion , 1992 .

[71]  Marilyn A. Walker,et al.  Standards for Dialogue Coding in Natural Language Processing , 1997 .

[72]  James Pustejovsky,et al.  Conceptual and representational choices in defining an ISO standard for semantic role annotation , 2013 .

[73]  Kôiti Hasida,et al.  Towards an ISO Standard for Dialogue Act Annotation , 2010, LREC.

[74]  Dafydd Gibbon Time Types and Time Trees: Prosodic Mining and Alignment of Temporally Annotated Data , 2006 .

[75]  Roger K. Moore,et al.  Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation , 2000 .

[76]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[77]  James Pustejovsky,et al.  The Specification Language TimeML , 2005, The Language of Time - A Reader.

[78]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[79]  Adam Przepiórkowski,et al.  TEI P5 as an XML Standard for Treebank Encoding , 2009 .

[80]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[81]  Nancy Ide,et al.  MULTEXT: Multilingual Text Tools and Corpora , 1994, COLING.

[82]  Laurent Romary,et al.  Outline of the International Standard Linguistic Annotation Framework , 2003, ACL.

[83]  Harry Bunt,et al.  Context and Dialogue Control , 1994 .

[84]  Ki Yong Lee Formal semantics for interpreting temporal annotation , 2008 .

[85]  Silvie Cinková From PropBank to EngValLex: Adapting the PropBank-Lexicon to the Valency Theory of the Functional Generative Description , 2006, LREC.

[86]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[87]  D. Gibbon Modelling gesture as speech: A linguistic approach , 2011 .

[88]  Dafydd Gibbon,et al.  Handbook of Technical Communication , 2012 .

[89]  Laurent Romary,et al.  TBX goes TEI - Implementing a TBX basic extension for the Text Encoding Initiative guidelines , 2014, ArXiv.

[90]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[91]  Stephan Tobies,et al.  Complexity results and practical algorithms for logics in knowledge representation , 2001, ArXiv.

[92]  James Pustejovsky,et al.  Annotating temporal and event quantification , 2010 .

[93]  Dominique Vicard Algorithms and architectures for continuous speech acoustic-phonetic decoding : Original French title: Algorithmes et architectures pour le décodage acoustico-phonétique de la parole continue , 1988, Speech Commun..

[94]  Harry Bunt,et al.  Dialogue pragmatics and context specification , 2000, Abduction, Belief and Context in Dialogue.

[95]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[96]  Reut Tsarfaty,et al.  A Unified Morpho-Syntactic Scheme of Stanford Dependencies , 2013, ACL.

[97]  Harry Bunt,et al.  The independence of dimensions in multidimensional dialogue act annotation , 2009, HLT-NAACL.

[98]  Jens Lehmann,et al.  Linked-Data Aware URI Schemes for Referencing Text Fragments , 2012, EKAW.

[99]  Dafydd Gibbon,et al.  Annotation Pro + TGA: automation of speech timing analysis , 2014, LREC.

[100]  Menzo Windhouwer,et al.  Experiences with the ISOcat Data Category Registry , 2014, LREC.

[101]  Kiyong Lee The Annotation of Measure Expressions in ISO Standards , 2015, ACL 2015.

[102]  Laurent Romary,et al.  Towards Interoperability of ISO Standards for Language Resource Management , 2010 .

[103]  Andrei Popescu-Belis,et al.  Dialogue Acts: One or More Dimensions? , 2007 .