Lemmatization and Lexicalized Statistical Parsing of Morphologically-Rich Languages: the Case of French

This paper shows that training a lexicalized parser on a lemmatized morphologically-rich treebank such as the French Treebank slightly improves parsing results. We also show that lemmatizing a similar in size subset of the English Penn Treebank has almost no effect on parsing performance with gold lemmas and leads to a small drop of performance when automatically assigned lemmas and POS tags are used. This highlights two facts: (i) lemmatization helps to reduce lexicon data-sparseness issues for French, (ii) it also makes the parsing process sensitive to correct assignment of POS tags to unknown words.

[1]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[2]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[3]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[4]  Beth Ann Hockey,et al.  XTAG System - A Wide Coverage Grammar for English , 1994, COLING.

[5]  Marie Candito,et al.  Cross parser evaluation and tagset variation: a French treebank study , 2009 .

[6]  Josef van Genabith,et al.  Treebank Annotation Schemes and Parser Evaluation for German , 2007, EMNLP.

[7]  Josef van Genabith,et al.  Preparing, restructuring, and augmenting a French treebank:lexicalised parsers or coherent treebanks? , 2007 .

[8]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[9]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[10]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[11]  Andy Way,et al.  Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations , 2004, ACL.

[12]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[15]  Yannick Versley,et al.  Scalable Discriminative Parsing for German , 2009, IWPT.

[16]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[17]  Djamé Seddah,et al.  On Statistical Parsing of French with Supervised and Semi-Supervised Strategies , 2009 .

[18]  Grzegorz Chrupala,et al.  Towards a machine-learning architecture for lexical functional grammar parsing , 2008 .

[19]  Benoît Sagot,et al.  The Lefff 2 syntactic lexicon for French: architecture, acquisition, use , 2006, LREC.

[20]  Miriam Butt,et al.  A grammar writer's cookbook , 1999 .

[21]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[22]  Reut Tsarfaty,et al.  Enhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-Set Mapping, and EM-HMM-Based Lexical Probabilities , 2009, EACL.

[23]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.