Towards a methodology for evaluating alignment and matching algorithms Version 1 . 0 Ontology Alignment Evaluation Initiative

This document considers the potential strategies for experimentally evaluating on-tology alignment algorithms. It first identifies various goals for such an evaluation, the most important objective being the improvement of existing methods. It considers the various parameters of the alignment task that must be controlled during the experiment and examine the measures that can be used for an evaluation. It then propose a framework for organising the evaluation based on some principles and efforts that have already been undergone in the specific field of ontology alignment. Executive Summary Heterogeneity problems on the semantic web can be solved, for some of them, by aligning or matching heterogeneous ontologies. Aligning ontologies consists of finding the corresponding entities in these ontologies. Many techniques are available for achieving ontology alignment and many systems have been developed based on these techniques. However, few comparisons and few integration is actually provided by these implementations. The present report studies what kind of evaluation can be carried out on alignment algorithms in order to help the worldwide research community to improve on the current techniques. It should be considered as a white paper describing what the Ontology Alignment Evaluation Initiative is supposed to be. In this document, we first examine the purpose and types of evaluation as well as established evaluation methodology (§1). We found that two different kinds of benchmarks are worth developing for ontology alignment: competence benchmarks based on many " unit tests " which characterise a particular situation and enable to assess the capabilities of each algorithms and performance benchmarks based on challenging " real-world " situations in which algorithms are in competition. We have examined the possible variations of the ontology alignment problem (§2) and the possible measures that can be used for evaluating alignment results (§3). This allows us to specify the profile of the kind of benchmarks to be performed and how results will be evaluated. The variation opportunities are very large so we had to restrict the considered task (at least for competence benchmarks) drastically. These restrictions could be relaxed in further evaluation or when considering and evaluating algorithms on a particular, clearly identified subtask. Concerning the evaluation measure, precision and recall are, so far, the best understood measures. However, it will be very important in the future to involve resource consumption measures. Then we draw on previous experiments in order to design some guidelines for performing an evaluation campaign. This involves …