Ubiquitous Usage of a Broad Coverage French Corpus: Processing the Est Republicain corpus

In this paper, we introduce a set of resources that we have derived from the EST REPUBLICAIN CORPUS, a large, freely-available collection of regional newspaper articles in French, totaling 150 million words. Our resources are the result of a full NLP treatment of the EST REPUBLICAIN CORPUS: handling of multi-word expressions, lemmatization, part-of-speech tagging, and syntactic parsing. Processing of the corpus is carried out using statistical machine-learning approaches - joint model of data driven lemmatization and part- of-speech tagging, PCFG-LA and dependency based models for parsing - that have been shown to achieve state-of-the-art performance when evaluated on the French Treebank. Our derived resources are made freely available, and released according to the original Creative Common license for the EST REPUBLICAIN CORPUS. We additionally provide an overview of the use of these resources in various applications, in particular the use of generated word clusters from the corpus to alleviate lexical data sparseness for statistical parsing.

[1]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[2]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[3]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[4]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[5]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[6]  Benoît Sagot,et al.  The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French , 2010, LREC.

[7]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[8]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[9]  Marie Candito,et al.  Parsing Word Clusters , 2010, SPMRL@NAACL-HLT.

[10]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[11]  Djamé Seddah,et al.  On Statistical Parsing of French with Supervised and Semi-Supervised Strategies , 2009 .

[12]  Josef van Genabith,et al.  Adapting WSJ-Trained Parsers to the British National Corpus using In-Domain Self-Training , 2007, IWPT.

[13]  Fernando Pereira,et al.  Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[14]  Joakim Nivre,et al.  Benchmarking of Statistical Dependency Parsers for French , 2010, COLING.

[15]  Gwendoline Fox,et al.  Approche quantitative en syntaxe : l’exemple de l’alternance de position de l’adjectif épithète en français , 2010, JEPTALNRECITAL.

[16]  Marie Candito,et al.  Cross parser evaluation and tagset variation: a French treebank study , 2009 .

[17]  Marie Candito,et al.  Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical (The Sequoia Corpus : Syntactic Annotation and Use for a Parser Lexical Domain Adaptation Method) [in French] , 2012, JEP/TALN/RECITAL.

[18]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[19]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[20]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[21]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[22]  Marie Candito,et al.  A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts , 2011, IWPT.

[23]  Alexis Nasr,et al.  Modèles génératif et discriminant en analyse syntaxique : expériences sur le corpus arboré de Paris 7 (Generative and discriminative models in parsing: experiments on the Paris 7 Treebank) , 2011, JEPTALNRECITAL.

[24]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[25]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[26]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[27]  Josef van Genabith,et al.  Using very large corpora to detect raising and control verbs , 2007 .

[28]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[29]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[30]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[31]  Ralph Grishman,et al.  The American National Corpus: A Standardized Resource for American English , 2000, LREC.

[32]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.