NP Subject Detection in Verb-initial Arabic Clauses

Phrase re-ordering is a well-known obstacle to robust machine translation for language pairs with significantly different word orderings. For Arabic-English, two languages that usually differ in the ordering of subject and verb, the subject and its modifiers must be accurately moved to produce a grammatical translation. This operation requires more than base phrase chunking and often defies current phrase-based statistical decoders. We present a conditional random field sequence classifier that detects the full scope of Arabic noun phrase subjects in verb-initial clauses at the Fβ=1 61.3% level, a 5.0% absolute improvement over a statistical parser baseline. We suggest methods for integrating the classifier output with a statistical decoder and present preliminary machine translation results.

[1]  Abdelkader Fassi Fehri,et al.  Issues in the Structure of Arabic Clauses and Words , 1993 .

[2]  Kevin Knight,et al.  Decoding Complexity in Word-Replacement Translation Models , 1999, Comput. Linguistics.

[3]  Eugene Charniak,et al.  Assigning Function Tags to Parsed Text , 2000, ANLP.

[4]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[5]  Ted Briscoe,et al.  High Precision Extraction of Grammatical Relations , 2001, COLING.

[6]  Mats Rooth,et al.  Parse Forest Computation of Expected Governors , 2001, ACL.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[11]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[12]  Daniel M. Bikel,et al.  Intricacies of Collins’ Parsing Model , 2004, CL.

[13]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[14]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[15]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[16]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic , 2005 .

[17]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[18]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[19]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[20]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[21]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Mona T. Diab Improved Arabic Base Phrase Chunking with a new enriched POS tag set , 2007, SEMITIC@ACL.

[24]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.