论文信息 - arTenTen: a new, vast corpus for Arabic

arTenTen: a new, vast corpus for Arabic

We present arTenTen, a web crawled corpus of Arabic, gathered in 2012, and a member of the TenTen Corpus Family (Jakubicek et al 2013). arTenTen comprises 5.8 billion words. It has been carefully cleaned, including duplicate removal, using the JusText and Onion tools (Pomikalek 2011). We are currently (May 2013) in the process of tokenising, lemmatising and part-of-speech tagging arTenTen with the leading MADA tool version 3.2 (Habash and Rambow 2005; Habash et al. 2009). Once arTenTen is fully encoded, we will compare it with Arabic Gigaword and an earlier web-crawled corpus (Sharoff 2006). We also plan to explore arTenTen’s composition in relation to Modern Standard Arabic and the dialects, using, amongst other things, Buckwalter and Parkinson’s Frequency Dictionary (2011) and the keywords method presented in (Kilgarriff 2012).

[1] Nizar Habash,et al. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[2] Serge Sharo. Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[3] Nizar Habash,et al. MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[4] Vít Suchomel,et al. Efficient Web Crawling for Large Text Corpora , 2012 .

[5] Adam Kilgarriff,et al. Getting to Know Your Corpus , 2012, TSD.

[6] Adam Kilgarriff,et al. The TenTen Corpus Family , 2013 .

[7] Tim Buckwalter,et al. A Frequency Dictionary of Arabic: Core Vocabulary for Learners , 2010 .

[8] Jan Pomikálek. Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[9] Nizar Habash,et al. On Arabic Transliteration , 2007 .