论文信息 - LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual

LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual

The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research. Among these are a number of annotated treebank corpora for Arabic. Typically, these corpora consist of a single collection of annotated documents. NLP research, however, usually requires multiple data sets for the purposes of training models, developing techniques, and final evaluation. Therefore it becomes necessary to divide the corpora used into the required data sets (divisions). This document details a set of rules that have been defined to enable consistent divisions for old and new Arabic treebanks (ATB) and related corpora.

[1] Ruhi Sarikaya,et al. Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[2] Nizar Habash,et al. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[3] Nizar Habash,et al. Parsing Arabic Dialects , 2006, EACL.

[4] M. Maamouri,et al. The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[5] Nizar Habash,et al. Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[6] Nizar Habash,et al. Developing and Using a Pilot Dialectal Arabic Treebank , 2006, LREC.

[7] Daniel Jurafsky,et al. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[8] Christopher D. Manning,et al. Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.