Statistical Parsing of Spanish and Data Driven Lemmatization

Although parsing performances have greatly improved in the last years, grammar inference from treebanks for morphologically rich lan- guages, especially from small treebanks, is still a challenging task. In this paper we in- vestigate how state-of-the-art parsing perfor- mances can be achieved on Spanish, a lan- guage with a rich verbal morphology, with a non-lexicalized parser trained on a treebank containing only around 2,800 trees. We rely on accurate part-of-speech tagging and data- driven lemmatization in order to cope with lexical data sparseness. Providing state-of- the-art results on Spanish, our methodology is applicable to other languages.

[1]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[2]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[3]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[4]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[5]  Marie Candito,et al.  Cross parser evaluation and tagset variation: a French treebank study , 2009 .

[6]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[7]  Anna Corazza The Berkeley Parser at the EVALITA 2009 Constituency Parsing Task , 2009 .

[8]  Benoît Sagot,et al.  A Morphological and Syntactic Wide-coverage Lexicon for Spanish: The Leffe , 2009, RANLP.

[9]  Marie Candito,et al.  Parsing Word Clusters , 2010, SPMRL@NAACL-HLT.

[10]  Grzegorz Chrupala,et al.  Towards a machine-learning architecture for lexical functional grammar parsing , 2008 .

[11]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[12]  Joseph Le Roux,et al.  Data Driven Lemmatization for Statistical Constituent Parsing of Italian , 2012 .

[13]  Dan Klein,et al.  Parsing German with Latent Variable Grammars , 2008 .

[14]  Josef van Genabith,et al.  Lemmatization and Statistical Lexicalized Parsing of Morphologically-Rich Languages , 2010, HLT-NAACL 2010.

[15]  Josef van Genabith,et al.  Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French , 2010, SPMRL@NAACL-HLT.

[16]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[17]  Michael Collins,et al.  Morphology and Reranking for the Statistical Parsing of Spanish , 2005, HLT.