暂无分享,去创建一个
The 'macro F1' metric is frequently used to evaluate binary, multi-class and multi-label classification problems. Yet, we find that there exist two different formulas to calculate this quantity. In this note, we show that only under rare circumstances, the two computations can be considered equivalent. More specifically, one formula well 'rewards' classifiers which produce a skewed error type distribution. In fact, the difference in outcome of the two computations can be as high as 0.5. Finally, we show that the two computations may not only diverge in their scalar result but also lead to different classifier rankings.
[1] Sheng Zhang,et al. Neural-Davidsonian Semantic Proto-role Labeling , 2018, EMNLP.
[2] Preslav Nakov,et al. SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.
[3] Charles Elkan,et al. Optimal Thresholding of Classifiers to Maximize F1 Measure , 2014, ECML/PKDD.