A Manual for Web Corpus Crawling of Low Resource Languages

Since the seminal publication of “Web as Corpus” [1], the potential of creating corpora from the web has been realized for good for the creation of both online and offline corpora: noisy vs. clean, balanced vs. convenient, annotated vs. raw, small vs. big are only some antonyms that can be used to describe the range of possible corpora that can be and have been created. In our case, in the wake of the project Under Resourced Language Content Finder (URLCoFi), we describe a systematic approach to the compilation of corpora for low (or under) resource(d) languages (LRL) from the web in connection with a free eLearning course funded by studiumdigitale at Goethe University, Frankfurt. Despite the ease of retrieval of documents from the web, some characteristics of the digital medium introduce certain difficulties. For instance, if someone was to collect all documents on the web in a certain language, firstly, the collection could only be a snapshot since the web constantly changes content and secondly, there would be no way to ascertain completeness. In this paper, we show ways to deal with such difficulties in search scenarios for LRLs presenting experiences springing from a course about this topic. [1] A. Kilgarriff and G. Grefenstette, “Web as corpus,” in Proceedings of Corpus Linguistics 2001 , 2001, pp. 342–344.

[1]  Holly Hearon,et al.  Orality and Literacy , 2016 .

[2]  M. Cysouw Disentangling geography from genealogy , 2013 .

[3]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[4]  Duygu Özge Demir Dying Words : Endangered Languages and What They Have to Tell Us , 2012 .

[5]  András Kornai Digital language death , 2013 .

[6]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[7]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[8]  Éva Csató Johanson,et al.  The Turkic Languages , 1998 .

[9]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[10]  Lindsay J. Whaley,et al.  Dying words: endangered languages and what they have to tell us , 2011 .

[11]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[12]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[13]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[14]  B. Comrie,et al.  Appendixes to Some observations on typological features of hunter-gatherer languages , 2013 .

[15]  John F. Dooley,et al.  History of Cryptography and Cryptanalysis , 2018, History of Computing.

[16]  Robert W. Gehl,et al.  Weaving the Dark Web: Legitimacy on Freenet, Tor, and I2p , 2018 .

[17]  Antal van den Bosch,et al.  Estimating search engine index size variability: a 9-year longitudinal study , 2016, Scientometrics.

[18]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.