Specification of a General Linguistic Annotation Framework and its Use in a Real Context

In this paper we present AWA, a general architecture for representing the linguistic information produced by diverse linguistic processors. Our aim is to establish a coherent and flexible representation scheme that will be the basis for the exchange of information. We use TEI-P4 conformant feature structures as a representation schema for linguistic analyses. A consistent underlying data model, which captures the structure and relations contained in the information to be manipulated, has been identified and implemented by a set of classes following the object-oriented paradigm. As an example of the usefulness of the model, we will show the usage of the framework in a real context: two corpora have been annotated by means of an application which aim is to exploit and manipulate the data created by the linguistic processors developed so far.

[1]  Laurent Romary,et al.  International standard for a linguistic annotation framework , 2003, HLT-NAACL 2003.

[2]  Xabier Artola,et al.  Structure, Annotation and Tools in the Basque ZT Corpus , 2006, LREC.

[3]  Xabier Arregi,et al.  A word-grammar based morphological analyzer for agglutinative languages , 2000, COLING.

[4]  Mark Liberman,et al.  ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation , 2000, LREC.

[5]  Kalina Bontcheva,et al.  Evolving GATE to meet new challenges in language engineering , 2004, Natural Language Engineering.

[6]  Branimir Boguraev,et al.  The talent system: TEXTRACT architecture and data model , 2003, HLT-NAACL 2003.

[7]  Itziar Aduriz,et al.  EUSLEM: A Lemmatiser/Tagger for Basque , 1996 .

[8]  Jonathan G. Fiscus,et al.  A Practical Introduction to ATLAS , 2002 .

[9]  Ulrich Schäfer,et al.  WHAT: An XSLT-based Infrastructure for the Integration of Natural Language Processing Components , 2003, HLT-NAACL 2003.

[10]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[11]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[12]  N. Ezeiza,et al.  EULIA : a graphical web interface for creating , browsing and editing linguistically annotated corpora , 2004 .

[13]  N. Ezeiza,et al.  A framework for representing and managing linguistic annotations based on typed feature structures , 2005 .

[14]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.