Corpus and exploitation tool: IULACT and bwanaNet

Over the last decades corpus linguistics methods have found ever increasing use in almost all linguistics related studies, mainly due to their usefulness to get and validate results. The IULACT is a project from the Applied Linguistics Institute intended to compile a corpus of LSP texts. This corpus includes documents from a variety of domains, registers and languages. In contrast to other resources captured directly from the web, the texts of the IULACT have been selected in a supervised way, transformed to a clean SGML format and processed with a set of specific tools. The overall goal of the project is to provide not only an infrastructure to study LSP but also a resource for computational linguistic research. This article describes the IULACT corpus, its architecture as well its processing tools. It starts from its design criteria and analyzes all the necessary processing stages and related software tools. Finally bwanaNet, the corpus browser tool, is described.

[1]  Stefanie Dipper,et al.  XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation , 2005, Berliner XML Tage.

[2]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[3]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[4]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[5]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[6]  Joan Carreras i Martí Diccionari de la llengua catalana , 1983 .

[7]  Rashmi Prasad,et al.  The Penn Discourse Treebank , 2004, LREC.

[8]  Lluís de Yzaguirre,et al.  El lematizador "Palic" del IULA (UPF) , 2001 .

[9]  M. Teresa Cabré Castellví,et al.  EL CORPUS DE L'IULA: ETIQUETARIS , 1998 .

[10]  Marta Ribas,et al.  Alineación automática de traducciones: descripción y usos en los ámbitos de la profesión, de la docencia y de la investigación traductológica , 2001 .

[11]  Ronnie W. Smith,et al.  Current and New Directions in Discourse and Dialogue , 2004 .

[12]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[13]  Manfred Pinkal,et al.  Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation , 2003, ACL.

[14]  Manuel Alvar Ezquerra,et al.  Diccionario actual de la lengua española , 1990 .

[15]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[16]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[19]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.