Ubiquitous Usage of a French Large Corpus: Processing the Est Republicain Corpus

In this paper, we introduce a set of resources that we have derived from the EST REPUBLICAIN CORPUS, a large, freely-available collection of regional newspaper articles in French, totaling 150 million words. Our resources are the result of a full NLP treatment of the EST REPUBLICAIN CORPUS: handling of multi-word expressions, lemmatization, part-of-speech tagging, and syntactic parsing. Processing of the corpus is carried out using statistical machine-learning approaches - joint model of data driven lemmatization and part- of-speech tagging, PCFG-LA and dependency based models for parsing - that have been shown to achieve state-of-the-art performance when evaluated on the French Treebank. Our derived resources are made freely available, and released according to the original Creative Common license for the EST REPUBLICAIN CORPUS. We additionally provide an overview of the use of these resources in vari- ous applications, in particular the use of generated word clusters from the corpus to alleviate lexical data sparseness for statistical parsing.

[1]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[2]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[3]  Josef van Genabith,et al.  Using very large corpora to detect raising and control verbs , 2007 .

[4]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[5]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[6]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[7]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[8]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[9]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[10]  Josef van Genabith,et al.  Adapting WSJ-Trained Parsers to the British National Corpus using In-Domain Self-Training , 2007, IWPT.

[11]  Djamé Seddah,et al.  On Statistical Parsing of French with Supervised and Semi-Supervised Strategies , 2009 .

[12]  Marie Candito,et al.  Cross parser evaluation and tagset variation: a French treebank study , 2009 .

[13]  Josef van Genabith,et al.  Lemmatization and Statistical Lexicalized Parsing of Morphologically-Rich Languages , 2010, HLT-NAACL 2010.

[14]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[15]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[16]  Marie Candito,et al.  Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical (The Sequoia Corpus : Syntactic Annotation and Use for a Parser Lexical Domain Adaptation Method) [in French] , 2012, JEP/TALN/RECITAL.

[17]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[18]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[19]  Benoît Sagot,et al.  The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French , 2010, LREC.

[20]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[21]  Marie Candito,et al.  Parsing Word Clusters , 2010, SPMRL@NAACL-HLT.

[22]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[23]  Alexis Nasr,et al.  Modèles génératif et discriminant en analyse syntaxique : expériences sur le corpus arboré de Paris 7 (Generative and discriminative models in parsing: experiments on the Paris 7 Treebank) , 2011, JEPTALNRECITAL.

[24]  Fernando Pereira,et al.  Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[25]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[26]  Gwendoline Fox,et al.  Approche quantitative en syntaxe : l’exemple de l’alternance de position de l’adjectif épithète en français , 2010, JEPTALNRECITAL.

[27]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[28]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[29]  Marie Candito,et al.  A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts , 2011, IWPT.

[30]  Joakim Nivre,et al.  Benchmarking of Statistical Dependency Parsers for French , 2010, COLING.

[31]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[32]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.