Better Arabic Parsing: Baselines, Evaluations, and Analysis

In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design. First, we identify sources of syntactic ambiguity understudied in the existing parsing literature. Second, we show that although the Penn Arabic Treebank is similar to other tree-banks in gross statistical terms, annotation consistency remains problematic. Third, we develop a human interpretable grammar that is competitive with a latent variable PCFG. Fourth, we show how to build better models for three different parsers. Finally, we show that in application settings, the absence of gold segmentation lowers parsing performance by 2--5% F1.

[1]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[2]  M. Maamouri,et al.  Creating a Methodology for Large-Scale Correction of Treebank Annotation : The Case of the Arabic Treebank , 2009 .

[3]  Frank Keller,et al.  Probabilistic Parsing for German Using Sister-Head Dependencies , 2003, ACL.

[4]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[5]  Abdelkader Fassi Fehri,et al.  Issues in the Structure of Arabic Clauses and Words , 1993 .

[6]  Nizar Habash,et al.  Challenges in Building an Arabic-English GHMT System with SMT Components , 2006, AMTA.

[7]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[8]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[9]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.

[10]  Mary P. Harper,et al.  SParseval: Evaluation Metrics for Parsing Speech , 2006, LREC.

[11]  Martin Rajman,et al.  Lattice Parsing for Speech Recognition , 1999 .

[12]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[13]  Daniel M. Bikel,et al.  Intricacies of Collins’ Parsing Model , 2004, CL.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Andrew B. Clegg,et al.  Evaluating and Integrating Treebank Parsers on a Biomedical Corpus , 2005, ACL 2005.

[16]  Slav Petrov,et al.  Coarse-to-Fine Natural Language Processing , 2011, Theory and Applications of Natural Language Processing.

[17]  Markus Dickinson,et al.  Error detection and correction in annotated corpora , 2005 .

[18]  Noah A. Smith,et al.  Joint Morphological and Syntactic Disambiguation , 2007, EMNLP.

[19]  Christopher D. Manning,et al.  NP Subject Detection in Verb-initial Arabic Clauses , 2009, MTSUMMIT.

[20]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[21]  Jan Hajic,et al.  Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[22]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[23]  Sandra Kübler,et al.  How does treebank annotation influence parsing? Or how not to compare apples and oranges , 2007 .

[24]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[25]  Seth Kulick,et al.  Construct State Modification in the Arabic Treebank , 2008, ACL.

[26]  Michael Collins,et al.  Morphology and Reranking for the Statistical Parsing of Spanish , 2005, HLT.

[27]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[28]  Ann Bies,et al.  Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools , 2004 .

[29]  Josef van Genabith,et al.  Treebank Annotation Schemes and Parser Evaluation for German , 2007, EMNLP.

[30]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[31]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[32]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[33]  Chris Dyer,et al.  Using a maximum entropy model to build segmentation lattices for MT , 2009, NAACL.

[34]  Seth Kulick,et al.  Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines , 2008, LREC.

[35]  Mahmoud Al-Batal Connectives as Cohesive Elements in a Modern Expository Arabic Text , 1990 .

[36]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[37]  Sandra Kübler How Do Treebank Annotation Schemes Influence Parsing Results? Or How Not to Compare Apples And Oranges , 2005 .

[38]  Seth Kulick,et al.  Parsing the Arabic Treebank: Analysis and Improvements , 2006 .

[39]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic , 2005 .

[40]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[41]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[42]  Reut Tsarfaty,et al.  A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing , 2008, ACL.

[43]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[44]  Reut Tsarfaty,et al.  Integrated Morphological and Syntactic Disambiguation for Modern Hebrew , 2006, ACL.

[45]  Khalil Sima'an,et al.  Relational-Realizational Parsing , 2008, COLING.