论文信息 - BootCaT: Bootstrapping Corpora and Terms from the Web

BootCaT: Bootstrapping Corpora and Terms from the Web

This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. The procedure requires only a small set of seed terms as input. The seeds are used to build a corpus via automated Google queries, and more terms are extracted from this corpus. In turn, these new terms are used as seeds to build a larger corpus via automated queries, and so forth. The corpus and the unigram terms are then used to extract multi-word terms. We conducted an evaluation of the tools by applying them to the construction of English and Italian corpora and term lists from the domain of psychiatry. The results illustrate the potential usefulness of the tools.

Silvia Bernardini | Marco Baroni | Marco Baroni | Silvia Bernardini

[1] Kevin Hemenway,et al. Spidering Hacks , 2003 .

[2] James Joseph Biundo,et al. Analysis of Contingency Tables , 1969 .

[3] Adam Kilgarriff,et al. Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[4] Rayid Ghani,et al. Mining the web to create minority language corpora , 2001, CIKM '01.

[5] Paul Rayson,et al. Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[6] William Fleisher,et al. Comparative study of trauma-related phenomena in subjects with pseudoseizures and subjects with epilepsy. , 2002, The American journal of psychiatry.

[7] Patrick Pantel,et al. A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[8] 李幼升,et al. Ph , 1989 .

[9] Chantal Enguehard,et al. Automatic Natural Acquisition of a Terminology , 1995, J. Quant. Linguistics.

[10] H. Kucera,et al. Computational analysis of present-day American English , 1967 .