Enriching a French Treebank

This paper presents the current status of the French treebank developed at Paris 7 (Abeille et al., 2003a). The corpus comprises 1 million words from the newspaper le Monde, fully annotated and disambiguated for parts of speech, inflectional morphology, compounds and lemmas, and syntactic constituents. It is representative of contemporary normalized written French, and covers a variety of authors and subjects (economy, literature, politics, etc.), with extracts from newspapers ranging from 1989 to 1993. It has been used by computational linguists to train and evaluate taggers, parsers and lemmatizers, as well as by psycholinguists to extract lexical and syntactic preferences (Pynte et al., 2001). It is now being enriched with functional information, and used for parsing evaluation. 1. The French treebank Similarly to the Penn TreeBank, we have annotated both parts of speech and constituents. Differently from the Penn Treebank, we have also annotated compounds, lemmas and inflectional morphology. Our annotation choices are meant to be linguistically motivated and compatible with various linguistic theories. We have chosen surface-based annotations, with no empty categories (Abeille and Clement, 2002; Abeille et al., 2003b; Abeille, 2003). With compounds amalgamated and not counting punctuation marks, the treebank comprises 870 000 tokens, using 37 000 different lemmas, making up about 32 000 independent sentences. The average number of words per sentence is 27 and the average number of phrases is 20 (some phrases are unary). It has been automatically tagged and hand-corrected by human annotators in a first phase, and automatically chunked and hand-corrected in a second phase (Clement, 2001; Toussenel, 2001; Abeille et al., 2003a). In the first phase, the task of the annotators was to validate the sentence boundaries, as well as the compounds (for missing compounds or possible compounds irrelevant in a given context), and to validate the morpho-syntactic tags, especially for notoriously difficult cases (for example as a preposition or as a determiner). In the second phase, the annotators’ task was to validate the constituant labels and boundaries, adding embedding where appropriate, as well as to signal remaining errors which could have been overlooked in the first phase. They used a specific Emacs-based annotation tool. The annotated and validated corpus is formatted in XML, using the XCES recommendations, and is available for research purposes. We distinguish 14 lexical categories, used for simple words as well as for compounds: A (adjective), Adv (Adverb), CC (coordinating conjunction), CL (weak clitic pronoun), CS (subordinating conjunction), D (determiner) ET (foreign word), I (interjection), NC (common noun), NP (proper name), P (preposition), PRO (strong pronoun), V (verb), PONCT (punctuation mark). We distinguish 12 phrasal categories: AP (adjectival phrase), AdP (adverbial phrase), COORD (coordinated phrase), NP (noun phrase), PP (preposition phrase), VN (verbal nucleus), VPinf (infinitival clause), VPpart (participial clause), SENT (independent clause), Sint (parenthetical), Srel (relative clause), Ssub (other subordinated clause) We chose to only annotate major phrases, with little internal structure (we have determiners and modifying adjectives at the same level in the noun phrase for example). For the sake of simplicity, we make a parsimonious use of unary phrases. For rigid sequences of categories, such as dates or titles, it is difficult to determine the head, and we have one global NP with no internal constituents. For coordinations, we have a COORD phrase, for the conjunction and the non initial conjuncts) usually included inside a major phrase (headed by the initial conjunct). We do not have discontinuous constituents, since these can usually be recovered at the functional level : in Combien voulez-vous de pommes (lit. how many do you want of apples ?) both and de pommes have the same Object function. Most of the difficult cases were with PP attachment, or scope of coordination, and human annotators had to spend the necessary time to fully understand the sentences. We got rid of spurious ambiguities (with the same interpretation) by a Attach high heuristics, for example in support verb constructions such as ecrire un livre sur les indiens (write a book about Indians) where the PP complement passes the linguistic tests both as a complement of the Verb and as a complement of the preceding Noun, with no semantic difference. 2. Enrichment of the treebank 2.1. Enriching the treebank with grammatical functions Similarly to what has been done for the German Negra or Tiger Treebanks (Brants et al., 2003), we have added some functional information to the French treebank. We chose to annotate surface grammatical functions only, and mark them as labels on the phrasal categories. For clitics, we mark the corresponding functions on the verbal nucleus. Functional information such as complement (or modifier) of Noun or complement of Adjective is already implicit in the constituent hierarchy (or in the constituent label for relative clauses). So we have concentrated on the