论文信息 - A Flexible Infrastructure for Large Monolingual Corpora

A Flexible Infrastructure for Large Monolingual Corpora

In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentencebased text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).

Christian Wolff | Uwe Quasthoff | U. Quasthoff | Christian Wolff

[1] Uwe Quasthoff. Projekt Der Deutsche Wortschatz , 1997, GLDV-Jahrestagung.

[2] Ellen M. Voorhees,et al. Overview of the Seventh Text REtrieval Conference , 1998 .

[3] Amanda Spink,et al. Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[4] Uwe Quasthoff. Tools for automatic lexicon maintenance: acquisition, error correction, and the generation of missing values , 1998 .

[5] Monika Henzinger,et al. Analysis of a very large web search engine query log , 1999, SIGF.

[6] David Harel,et al. Drawing graphs nicely using simulated annealing , 1996, TOGS.

[7] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[8] Ellen M. Voorhees,et al. Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[9] Christian Wolff,et al. Linguistik und neue Medien , 1998 .