论文信息 - Design and development of Iberia: a corpus of scientific Spanish

Design and development of Iberia: a corpus of scientific Spanish

Iberia is a synchronic corpus of scientific Spanish designed mainly for terminological studies. In this paper, we describe its design and the infrastructure for its acquisition, processing and exploitation, including mark-up, linguistic annotation, indexing and the user interface. Two pre-processing tasks affecting a large number of words are described in detail: de-hyphenation and identification of text fragments in other languages. We also show how some of the reported statistics, namely, dispersion and association, are used for research on lexis.

Emilio del Rosal García | Jordi Porta Zamorano | Ignacio Ahumada Lara

[1] Nancy Ide,et al. Corpues enconding standard: SGML guidelines for encoding linguistic corpora , 1998, LREC.

[2] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[3] Samuel Reese,et al. FreeLing 2.1: Five Years of Open-source Language Processing Tools , 2010, LREC.

[4] Carlo Zaniolo,et al. Efficient Structural Joins on Indexed XML Documents , 2002, VLDB.

[5] Lluís Padró,et al. FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.

[6] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[7] JUSTIN ZOBEL,et al. Inverted files for text search engines , 2006, CSUR.

[8] Andreas Nürnberger,et al. A Comparative Study on Language Identification Methods , 2008, LREC.