The GRACE evaluation program aims at applying the Evaluation Paradigm to the evaluation of Part-of-Speech taggers for French. An interesting by-product of GRACE is the production of validated language resources necessary for the evaluation. After a brief recall of the origins and the nature of the Evaluation Paradigm, we show how it relates to other national and international initiatives. We then present the now ending GRACE evaluation campaign and describe its four main components (corpus building, tagging procedure, lexicon building, evaluation procedure), as well as its internal organization. 1. The Evaluation Paradigm The Evaluation Paradigm has been proposed as a mean to foster development in research and technology in the field of language engineering. Up to now, it has been mostly used in the United States in the framework of the ARPA and NIST projects on automatic processing of spoken and written language. The paradigm is based on a two step process: first, create textual or voice data in the form of raw corpora, tagged corpora or lexica, which are then distributed to main actors in the field of language engineering for the realization of natural language processing tools. These tools address problems like disambiguation, natural language database query, message understanding, automatic translation, dictation, dialog, character recognition, etc.; then, test and compare systems on similar data. The results of the tests and the discussions (within specific workshops, for example) triggered by their publication and comparison constitute a good basis for the evaluation of the pros and cons of the various methods. The resulting synergy is a dynamizing factor for the field of Language Engineering. The Linguistic Data Consortium (whose function is to collect language related data and to organize their distribution) is a typical illustration of positive consequences of programs implementing the Evaluation Paradigm. The GRACE evaluation program is meant to be an implementation of the Evaluation Paradigm in the field of morpho-syntactic tagging. As such it corresponds to an evaluation campaign of Part-Of-Speech taggers for French organized within an automated quantitative black-box evaluation framework. 2. The GRACE evaluation program Started upon the initiative of Joseph Mariani (LIMSI) and Robert Martin (INaLF), GRACE (Grammars and Resources for Analyzers of Corpora and their Evaluation) was part of the French program: “Cognition, Communication Intelligente et Ingénierie des Langues” (Cognition, Intelligent Communication and Language Engineering), jointly promoted by the Engineering Sciences and Human Sciences departments of the CNRS (National Center for Scientific Research). The GRACE program was intended to run over a four year period (1994-1997) and was planed in two phases: a first phase dedicated to Part-of-Speech taggers, and a second phase concerned with work on syntactic analyzers, but which has been abandoned. The first year was devoted to the setting up of a coordination committee in charge of running the project and a reflection committee. The responsibility of the reflection committee formed with a pannel of experts from various domains, is to define the evaluation protocol, to specify the reference tagset and lexicon, to decide which data will be made available to the participants, and to organize the workshop for the presentation of the final results. The third entity of the GRACE organization regroups all the participants. They come both from public institutions and industrial corporations. Only participants with fully operational systems were allowed to take part in the evaluation. Furthermore, only the participants which have agreed to describe how their system works (at least during a workshop whose attendance would be restricted to the sole participants) were authorized to take part in the workshop concluding the evaluation campaign. 20 participants, from both academia and industry, registered at the beginning of the project. During the project, this number slightly decreased. 17 took part in the dry-run and 13 in the final test the results of which will be published at the beginning of fall ’98. 3. Defining the Evaluation Procedure For the definition and the organization of the GRACE evaluation campaign, we build upon the work done in previous evaluation programs, in particular the evaluation campaigns which have been conducted in the United States, especially in the scope of ARPA Human Language Technology program. Namely: the MUC (MUC-1, MUC-2, MUC-3 ( Sundheim 1991), MUC-4 ( MUC 1992)) conferences, aiming at the evaluation of message understanding systems ; TIPSTER, concerning the evaluation of automated information extraction systems from raw text data; the TREC ( Harman 1993; Harman 1994) conferences, concerning the evaluation Information Retrieval systems operating on textual databases; ParsEval and SemEval, which find their origin in Ezra Black’s work ( Black 1991; Black 1993; Black 1994) on the evaluation of syntactic analyzers done within the scope of an ACL working group. GRACE also looked at the “Morpholympics” competition ( Hauser 1994a; Hauser 1994b), which was organized in spring 1994 at Erlangen University in Germany for the evaluation of German morphological analyzers. MUC and TREC use task oriented black-box evaluation schemes requiring no knowledge of the internal processes or theoretical underpinning of the systems being tested, while ParsEval and SemEval (some of which will be part of MUC-6) are approaches which attempt to evaluate systems at the module level by using a benchmark method based on a reference corpus annotated with syntactic structures agreed upon by a panel of experts. An additional list of evaluation methods for linguistic software (lingware) now in use in the industry was found in Marc Cavazza’s report (in French) for the French Ministry of Education and Research ( Cavazza 1994). An other extensive overview of evaluation programs for Natural Language Processing systems is provided in ( Sparck Jones and Galliers 1996). Similarly to the evaluation campaigns organized in the United-States, GRACE was divided into four phases: 1. training phase (“phase d’entraı̂nement”): distribution of the training data (the training corpus) to the participants for the initial set up of their systems; 2. dry-run phase (“phase d’essais”): distribution of a first set of data (the dry-run corpus) to the participants for a first real-size test of the evaluation protocol; the task used in the MUC evaluation campagns was for the systems to fill in predefined frames from texts relating US Navy manœuvres (MUC-1 and MUC-2) or terrorism acts (MUC-3 and MUC-4) 3. test phase (“phase de test”): distributionof the “real” evaluation data (the test corpus) to the participants and realization of the evaluation; 4. adjudication phase (“phase d’adjudication”): validation with the participants of the results of the evaluation; this phase leads to the organization of a workshop where all the participants present their methods and their systems and discuss the results of the evaluation. According to the task-oriented approach chosen in GRACE, the evaluation procedure was based on a automated comparison, on a common corpus of literary and journalistic texts, of the PoS produced by various tagging systems against PoS manually validated by an expert (tagging is therefore the task selected for evaluation). In addition, as the evaluation procedure has to be applicable for the simultaneous evaluation of several systems (that may very well use various tagging procedures –statistical, rule-based, ...), the definition of the evaluation metrics cannot rely on any presupposition about the internal characteristics of the tagging methods. It has therefore to be exclusively defined in terms of the outputs produced by the systems (pure “black-box” approach), which, in the case of tagging, can be minimally represented as sequences of pairs the elements of which are the word token and its tag (or tag list). Such an output is considered to be “minimal” for the tagging task because several taggers also produce some additional information in addition to the simple tags (e.g. lemmas). In GRACE, we decided to not take into account such additional information (for example, no evaluation of the eventual lemmatization provided by the systems was performed) and restrict ourselves to the tagging task, defined as aiming at associating one unique tag to each token (and not for instance a partially disambiguiated list of tags, which would have required a much more complex metrics for comparing the systems). However, even with such a minimalistic definition for the tagging task, the actual definition of a working evaluation metrics did require from the GRACE steering committee to take several decision about various crucial issues: how to compare systems that do not operate on the same tokens (i.e. use different segmentation procedures)? How to take into account the processing of compound forms? how to compare systems that do not use the same tagsets ? how to weighten in the evaluation the different components that build up any realistic tagging system? In particular, how to evaluate the influence of the capacity of a tagger to handle unknown words ? How to evaluate the influence of the quality of the lexical information available? Build upon the evaluation scheme initially proposed by Martin Rajman in (Adda et al.(1995)) and then adapted and extended by the GRACE committees, the evaluation procedure used in GRACE is characterized by the following aspects: Dealing with varying tokenizations The problem of the differences in the variations of text segmentation between the hand-tagged reference material and the text returned by the participants is a central issue for tagger evaluation. Indeed, Not to leave a complete freedom to the participant about the tokenizing algorithm (and the lexicon) used to segment the data, they had to
[1]
Donna Harman,et al.
The First Text REtrieval Conference (TREC-1)
,
1993
.
[2]
Simone Teufel.
A Support Tool for Tagset Mapping
,
1995,
ArXiv.
[3]
Patrick Paroubek,et al.
Les procédures de mesure automatique de l"action GRACE pour l"évaluation des assignateurs de Parties du Discours pour le Français
,
1997
.
[4]
Margaret King,et al.
Evaluating natural language processing systems
,
1996,
CACM.
[5]
Roland Hausser.
The Coordinator's Final Report on the first Morpholympics
,
1994,
LDV Forum.
[6]
Laurent Romary,et al.
Vers une normalisation des ressources linguistiques : le serveur SILFIDE
,
2000
.
[7]
Max Silberztein,et al.
Dictionnaires électroniques et analyse automatique de textes : le système intex
,
1993
.
[8]
Donna Harman.
The Second Text Retrieval Conference (TREC-2) | NIST
,
1994
.
[9]
Donna Harman.
The First Text REtrieval Conference (TREC-1) | NIST
,
1993
.
[10]
Ralph Grishman,et al.
A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars
,
1991,
HLT.
[11]
Nancy Ide,et al.
MULTEXT: Multilingual Text Tools and Corpora
,
1994,
COLING.
[12]
Donna Harman,et al.
The Second Text Retrieval Conference (TREC-2)
,
1995,
Inf. Process. Manag..
[13]
Beth Sundheim.
Third Message Understanding Evaluation and Conference (MUC-3): Phase 1 Status Report
,
1991,
HLT.
[14]
Margaret King,et al.
Evaluation of natural language processing systems
,
1991
.
[15]
Julia Galliers,et al.
Evaluating natural language processing systems
,
1995
.