论文信息 - Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of low density, where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.

[1] Uwe Quasthoff. Projekt Der Deutsche Wortschatz , 1997, GLDV-Jahrestagung.

[2] Chris Biemann,et al. Exploiting the Leipzig Corpora Collection , 2006 .

[3] Fredric C. Gey,et al. Proceedings of LREC , 2010 .

[4] Gregory Grefenstette,et al. Web as Corpus , 2003 .

[5] Serge Sharo. Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[6] Christian Biemann,et al. Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[7] Gerhard Heyer,et al. Calculating Communities by Link Analysis of URLs , 2004, IICS.

[8] Silvia Bernardini,et al. BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[9] Kevin P. Scannell. The Crúbadán Project: Corpus building for under-resourced languages , 2007 .