arTenTen: a new, vast corpus for Arabic
暂无分享,去创建一个
[1] Nizar Habash,et al. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.
[2] Serge Sharo. Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .
[3] Nizar Habash,et al. MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .
[4] Vít Suchomel,et al. Efficient Web Crawling for Large Text Corpora , 2012 .
[5] Adam Kilgarriff,et al. Getting to Know Your Corpus , 2012, TSD.
[6] Adam Kilgarriff,et al. The TenTen Corpus Family , 2013 .
[7] Tim Buckwalter,et al. A Frequency Dictionary of Arabic: Core Vocabulary for Learners , 2010 .
[8] Jan Pomikálek. Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .
[9] Nizar Habash,et al. On Arabic Transliteration , 2007 .