Our paper describes an on-going experiment aimed at creating a family of billion-token web corpora that could to a large extent deserve the designation “comparable”: corpora are of the same size, data gathered by crawling the web at (approximately) the same time, containing similar web-specific domains, genres and registers, further pre-processed, filtered and deduplicated by the same tools, morphologically annotated by (possibly) the same tagger and made available via Sketch Engine. To overcome the problem of great differences in the existing sketch grammars for the respective languages, a set of “compatible” sketch grammars have been written that will aid contrastive linguistic research and bilingual lexicographic projects. The sketch grammars use a uniform set of rules for all word categories (parts of speech) and the resulting set of tables is displayed in a fixed order in all languages.
[1]
Adam Kilgarriff,et al.
Slovene Word Sketches
,
2006
.
[2]
Vít Suchomel,et al.
Efficient Web Crawling for Large Text Corpora
,
2012
.
[3]
Adam Kilgarriff,et al.
A Web Corpus and Word Sketches for Japanese
,
2008
.
[4]
Jia-Fei Hong,et al.
Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research
,
2006,
PACLIC.
[5]
Slav Petrov,et al.
A Universal Part-of-Speech Tagset
,
2011,
LREC.
[6]
A. Kilgarriff,et al.
THE ROWAC CORPUS AND ROMANIAN WORD SKETCHES
,
2016
.
[7]
M. Piasecki,et al.
Polish tagger TaKIPI: rule based construction and optimization
,
2007
.
[9]
Silvia Bernardini,et al.
BootCaT: Bootstrapping Corpora and Terms from the Web
,
2004,
LREC.
[10]
Maria Khokhlova.
Building Russian Word Sketches as Models of Phrases
,
2010
.