论文信息 - hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.

Tomaz Erjavec | Nikola Ljubesic | T. Erjavec | Nikola Ljubesic

[1] Nikola Ljubešić,et al. Language Identification of Web Data for Building Linguistic Corpora , 2011 .

[2] Simon Krek,et al. The JOS Morphosyntactically Tagged Corpus of Slovene , 2008, LREC.

[3] Bruno Pouliquen,et al. Massive multi lingual corpus compilation: Acquis Communautaire and totale , 2005 .

[4] Zeljko Agic,et al. Evaluating Morphosyntactic Tagging of Croatian Texts , 2006, LREC.

[5] Silvia Bernardini,et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[6] Emiliano Raúl Guevara,et al. NoWaC: a large web-based corpus for Norwegian , 2010, WAC@NAACL-HLT.

[7] Pavel Pecina,et al. Building a Web Corpus of Czech , 2010, LREC.

[8] Tomaz Erjavec,et al. MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[9] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.