Syntax and Parsing of Semitic Languages

The grammar of Semitic languages is different from that of English and many other languages. Therefore, general-purpose statistical parsers are not always equally successful when applied to Semitic data. This chapter presents the syntax of Semitic languages and discusses how it challenges existing general-purpose parsing architectures. We then survey the different components of a generative probabilistic parsing system and show how they can be designed and implemented in order to effectively cope with these challenges. We finally present parsing results obtained for Hebrew and Arabic using different technologies in different scenarios. While parsing Semitic languages can already be made quite accurate using the present techniques, remaining challenges leave ample space for future research.

[1]  Khalil Sima'an,et al.  Three-Dimensional Parametrization for Parsing Morphologically Rich Languages , 2007, IWPT.

[2]  Rens Bod An efficient implementation of a new DOP model , 2003, EACL.

[3]  Yuval Krymolowski,et al.  Automatic Annotation of Morpho-Syntactic Dependencies in a Modern Hebrew Treebank , 2008 .

[4]  L.W.M. Bod Enriching Linguistics with Statistics , 1995 .

[5]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[6]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[7]  Joseph H. Greenberg,et al.  Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements , 1990, On Language.

[8]  D.H.J.K. Prescher A Short Tutorial on the Expectation-Maximization Algorithm , 2003 .

[9]  S. Potter,et al.  Universals of Language , 1966 .

[10]  Reut Tsarfaty,et al.  Integrated Morphological and Syntactic Disambiguation for Modern Hebrew , 2006, ACL.

[11]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[12]  Seth Kulick,et al.  Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation , 2008, LREC.

[13]  Reut Tsarfaty,et al.  Enhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-Set Mapping, and EM-HMM-Based Lexical Probabilities , 2009, EACL.

[14]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[15]  Khalil Sima'an,et al.  Relational-Realizational Parsing , 2008, COLING.

[16]  U. Shlonsky Clause Structure and Word Order in Hebrew and Arabic: An Essay in Comparative Semitic Syntax , 1997 .

[17]  Joshua Goodman,et al.  Probabilistic Feature Grammars , 1997, IWPT.

[18]  Detlef Prescher,et al.  A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars , 2004, ArXiv.

[19]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[20]  Yoav Goldberg,et al.  Hebrew Dependency Parsing: Initial Results , 2009, IWPT.

[21]  Nizar Habash,et al.  Machine translation between Hebrew and Arabic , 2011, Machine Translation.

[22]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[23]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[24]  P. Smolensky,et al.  Optimality Theory: Constraint Interaction in Generative Grammar , 2004 .

[25]  Khalil Sima'an,et al.  Modeling Morphosyntactic Agreement in Constituency-Based Parsing of Modern Hebrew , 2010, SPMRL@NAACL-HLT.

[26]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[27]  Khalil Sima'an,et al.  An Alternative to Head-Driven Approaches for Parsing a (Relatively) Free Word-Order Language , 2009, EMNLP.

[28]  Robert I. Damper,et al.  Can syllabification improve pronunciation by analogy of English? , 2006, Natural Language Engineering.

[29]  Yoav Goldberg,et al.  Easy-First Dependency Parsing of Modern Hebrew , 2010, SPMRL@NAACL-HLT.

[30]  Seth Kulick,et al.  Parsing the Arabic Treebank: Analysis and Improvements , 2006 .

[31]  Joakim Nivre,et al.  Dependency Parsing , 2009, Lang. Linguistics Compass.

[32]  Alon Itai,et al.  Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew , 1995, CL.

[33]  Reut Tsarfaty,et al.  Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages , 2010, HLT-NAACL 2010.

[34]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[35]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[36]  Gabi Danon,et al.  Syntactic definiteness in the grammar of Modern Hebrew , 2001 .

[37]  Reut Tsarfaty,et al.  Parsing Morphologically Rich Languages: Introduction to the Special Issue , 2013, Computational Linguistics.

[38]  Mary P. Harper,et al.  SParseval: Evaluation Metrics for Parsing Speech , 2006, LREC.

[39]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[40]  Stuart M. Shieber,et al.  Evidence against the context-freeness of natural language , 1985 .

[41]  Stephen R. Anderson,et al.  A-Morphous morphology , 1992 .

[42]  Mark C. Baker,et al.  On the Relationship of Object Agreement and Accusative Case: Evidence from Amharic , 2012, Linguistic Inquiry.

[43]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[44]  Andy Way,et al.  Wide-Coverage Deep Statistical Parsing Using Automatic Dependency Structure Annotation , 2008, CL.

[45]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[46]  J. Bresnan Lexical-Functional Syntax , 2000 .

[47]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[48]  Marie Candito,et al.  Parsing Word Clusters , 2010, SPMRL@NAACL-HLT.

[49]  Dan Klein,et al.  Factored A* Search for Models over Sequences and Trees , 2003, IJCAI.

[50]  Dan Klein,et al.  A* Parsing: Fast Exact Viterbi Parse Selection , 2003, NAACL.

[51]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[52]  Edit Doron,et al.  Agency and Voice: The Semantics of the Semitic Templates , 2003 .

[53]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[54]  Adwait Ratnaparkhi,et al.  A Linear Observed Time Statistical Parser Based on Maximum Entropy Models , 1997, EMNLP.

[55]  Reut Tsarfaty,et al.  A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing , 2008, ACL.

[56]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[57]  Mitchell P. Marcus,et al.  On the parameter space of generative lexicalized statistical parsing models , 2004 .

[58]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[59]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[60]  Nizar Habash,et al.  Improving Arabic Dependency Parsing with Lexical and Inflectional Morphological Features , 2010, SPMRL@NAACL-HLT.

[61]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[62]  Daniel M. Bikel,et al.  Intricacies of Collins’ Parsing Model , 2004, CL.

[63]  Mark Steedman,et al.  Surface structure and interpretation , 1996, Linguistic inquiry.

[64]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[65]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[66]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[67]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[68]  Yoav Goldberg,et al.  An Efficient Algorithm for Easy-First Non-Directional Dependency Parsing , 2010, NAACL.

[69]  Michael Elhadad,et al.  An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation , 2006, ACL.

[70]  Nizar Habash,et al.  Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features , 2013, CL.

[71]  Noah A. Smith,et al.  Joint Morphological and Syntactic Disambiguation , 2007, EMNLP.

[72]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[73]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[74]  Seth Kulick,et al.  Enhanced Annotation and Parsing of the Arabic Treebank , 2008 .

[75]  Scott McGlashan,et al.  Heads in grammatical theory , 1993 .

[76]  Yoav Goldberg,et al.  Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser , 2011, ACL.

[77]  Josef van Genabith,et al.  Morphological Features for Parsing Morphologically-rich Languages: A Case of Arabic , 2011, SPMRL@IWPT.

[78]  Reut Tsarfaty Participants in Action: Aspectual Meanings and Thematic Relations Interplay in the Semantics of Semitic Morphology , 2007 .

[79]  Josef van Genabith,et al.  Lemmatization and Statistical Lexicalized Parsing of Morphologically-Rich Languages , 2010, HLT-NAACL 2010.

[80]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[81]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[82]  Jon Oberlander,et al.  IN PROCEEDINGS OF EACL-2006 , 2006 .

[83]  Seth Kulick,et al.  Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines , 2008, LREC.

[84]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[85]  David Ellis,et al.  Multilevel Coarse-to-Fine PCFG Parsing , 2006, NAACL.

[86]  Evelina Andersson,et al.  Joint Evaluation of Morphological Segmentation and Syntactic Parsing , 2012, ACL.

[87]  Mark Steedman,et al.  Generative Models for Statistical Parsing with Combinatory Categorial Grammar , 2002, ACL.

[88]  Martin Emms Tree Distance and Some Other Variants of Evalb , 2008, LREC.

[89]  Khalil Sima'an,et al.  Building a tree-bank of modern hebrew text , 2001 .

[90]  Liang Huang,et al.  Forest Reranking: Discriminative Parsing with Non-Local Features , 2008, ACL.

[91]  Noam Chomsky,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[92]  Slav Petrov,et al.  Coarse-to-Fine Natural Language Processing , 2011, Theory and Applications of Natural Language Processing.

[93]  Arnold M. Zwicky,et al.  Heads in grammatical theory: Heads, bases and functors , 1993 .

[94]  Josef van Genabith,et al.  Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French , 2010, SPMRL@NAACL-HLT.

[95]  Mark Johnson,et al.  Joint and Conditional Estimation of Tagging and Parsing Models , 2001, ACL.

[96]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[97]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[98]  Jun'ichi Tsujii,et al.  Corpus-Oriented Grammar Development for Acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank , 2004, IJCNLP.

[99]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[100]  John D. Lafferty,et al.  Development and Evaluation of a Broad-Coverage Probabilistic Grammar of English-Language Computer Manuals , 1992, ACL.

[101]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[102]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[103]  Detlef Prescher,et al.  Head-Driven PCFGs with Latent-Head Statistics , 2005, IWPT.

[104]  Ivan A. Sag,et al.  Syntactic Theory: A Formal Introduction , 1999, Computational Linguistics.

[105]  Geoffrey K. Pullum,et al.  On the Distinction between Model-Theoretic and Generative-Enumerative Syntactic Frameworks , 2001, LACL.

[106]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[107]  Christopher D. Manning,et al.  Parsing Models for Identifying Multiword Expressions , 2013, CL.