Word-Based or Morpheme-Based? Annotation Strategies for Modern Hebrew Clitics

Morphologically rich languages pose a challenge to the annotators of treebanks with respect to the status of orthographic (space-delimited) words in the syntactic parse trees. In such languages an orthographic word may carry various, distinct, sorts of information and the question arises whether we should represent such words as a sequence of their constituent morphemes (i.e., a Morpheme-Based annotation strategy) or whether we should preserve their special orthographic status within the trees (i.e., a Word-Based annotation strategy). In this paper we empirically address this challenge in the context of the development of Language Resources for Modern Hebrew. We compare and contrast the Morpheme-Based and Word-Based annotation strategies of pronominal clitics in Modern Hebrew and we show that the Word-Based strategy is more adequate for the purpose of training statistical parsers as it provides a better PP-attachment disambiguation capacity and a better alignment with initial surface forms. Our findings in turn raise new questions concerning the interaction of morphological and syntactic processing of which investigation is facilitated by the parallel treebank we made available.

[1]  Reut Tsarfaty,et al.  A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing , 2008, ACL.

[2]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[3]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[4]  Roger Levy,et al.  Deep Dependencies from Context-Free Statistical Parsers: Correcting the Surface Dependency Approximation , 2004, ACL.

[5]  Khalil Sima'an,et al.  Data-Oriented Parsing , 2003 .

[6]  C. F. Hockett Two Models of Grammatical Description , 1954 .

[7]  Noah A. Smith,et al.  Joint Morphological and Syntactic Disambiguation , 2007, EMNLP.

[8]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[9]  Dorit Ravid,et al.  Word-level Morphology: A Psycholinguistic Perspective on Linear Formation in Hebrew Nominals , 2006 .

[10]  Khalil Sima'an,et al.  Three-Dimensional Parametrization for Parsing Morphologically Rich Languages , 2007, IWPT.

[11]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[12]  Reut Tsarfaty,et al.  Integrated Morphological and Syntactic Disambiguation for Modern Hebrew , 2006, ACL.

[13]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[14]  Khalil Sima'an,et al.  Building a tree-bank of modern hebrew text , 2001 .

[15]  James P. Blevins,et al.  Word-based morphology , 2006, Journal of Linguistics.

[16]  Khalil Sima'an,et al.  Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew , 2005, SEMITIC@ACL.

[17]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[18]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[19]  C. Habel,et al.  Language , 1931, NeuroImage.

[20]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[21]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[22]  Yunheng Ji MORPHOLOGY , 1937, A Grammar of Italian Sign Language (LIS).

[23]  Michael Elhadad,et al.  An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation , 2006, ACL.

[24]  Helmut Schmid Trace Prediction and Recovery with Unlexicalized PCFGs and Slash Features , 2006, ACL.

[25]  Frank Keller,et al.  Probabilistic Parsing for German Using Sister-Head Dependencies , 2003, ACL.