Tagger Evaluation Given Hierarchical Tag Sets

We present methods for evaluating human and automatictaggers that extend current practice in three ways. First, we show howto evaluate taggers that assign multiple tags to each test instance,even if they do not assign probabilities. Second, we show how toaccommodate a common property of manually constructed ``gold standards''that are typically used for objective evaluation, namely that there isoften more than one correct answer. Third, we show how to measureperformance when the set of possible tags is tree-structured in an IS-Ahierarchy. To illustrate how our methods can be used to measureinter-annotator agreement, we show how to compute the kappa coefficientover hierarchical tag sets.