论文信息 - Corpus representativeness for syntactic information acquisition

Corpus representativeness for syntactic information acquisition

This paper refers to part of our research in the area of automatic acquisition of computational lexicon information from corpus. The present paper reports the ongoing research on corpus representativeness. For the task of inducing information out of text, we wanted to fix a certain degree of confidence on the size and composition of the collection of documents to be observed. The results show that it is possible to work with a relatively small corpus of texts if it is tuned to a particular domain. Even more, it seems that a small tuned corpus will be more informative for real parsing than a general corpus.

Núria Bel | Núria Bel

[1] Victor Sadler,et al. Review of Lexical acquisition: exploiting on-line resources to build a lexicon by Uri Zernik. Lawrence Erlbaum Associates 1991. , 1993 .

[2] Douglas Biber,et al. Representativeness in corpus design , 1993 .

[3] Man-Seok Song,et al. Estimation of the Corpus Size for Solving Data Sparseness , 1999 .

[4] Aquilino Sánchez,et al. Predictability of word forms (types) and lemmas in linguistic corpora. A Case Study Based on the Analysis of the CUMBRE Corpus:: an 8-million-word Corpus of contemporary Spanish , 1997 .

[5] Victor Sadler,et al. Book Reviews: Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon , 1993, CL.

[6] Uri Zernik,et al. Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[7] Patrick Schone,et al. Language-independent Induction of Part of Speech Class Labels Using Only Language Universals , 2001, IJCAI 2001.

[8] Mark Lauer. How much is enough?: Data requirements for statistical NLP , 1995, ArXiv.