Metrics reloaded: Recommendations for image analysis validation

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, a large international expert consortium createdMetrics Reloaded, a comprehensive framework guiding researchers towards choosing metrics in a problem-aware manner. Following the convergence of ML methodology across application domains,Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint – a structured representation of the given problem that captures all aspects that are relevant for metric selection from the domain interest to the properties of the target structure(s), data set and algorithm output. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. Users are guided through the process of selecting and applying appropriate validation metrics while being made aware of potential pitfalls. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a common point of access to explore weaknesses and strengths of the most common validation metrics. An instantiation of the framework for various biological and medical image analysis use cases demonstrates its broad applicability across domains.

[1]  Christopher T. Zugates,et al.  A searchable image resource of Drosophila GAL4-driver expression patterns with single neuron resolution , 2022, bioRxiv.

[2]  Bjoern H Menze,et al.  Blob Loss: Instance Imbalance Aware Loss Functions for Semantic Segmentation , 2022, ArXiv.

[3]  Mitosis domain generalization in histopathology images - The MIDOG challenge , 2022, ArXiv.

[4]  Bjoern H Menze,et al.  The Medical Segmentation Decathlon , 2021, Nature Communications.

[5]  Anthonin Reilhac,et al.  Comparison of metrics for the evaluation of medical segmentations using prostate MRI dataset , 2021, Comput. Biol. Medicine.

[6]  Bjoern H Menze,et al.  Common Limitations of Image Processing Metrics: A Picture Story , 2021, ArXiv.

[7]  Ross B. Girshick,et al.  Boundary IoU: Improving Object-Centric Image Segmentation Evaluation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Spyridon Bakas,et al.  Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient , 2021, ArXiv.

[9]  Michael J. Dinneen,et al.  Nondeterminism and Instability in Neural Network Optimization , 2021, ICML.

[10]  Niklas Tötsch,et al.  The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation , 2021, BioData Min..

[11]  Sam Blackwell,et al.  Clinically Applicable Segmentation of Head and Neck Anatomy for Radiotherapy: Deep Learning Algorithm Development and Validation Study , 2020, Journal of medical Internet research.

[12]  Peter M. Full,et al.  Heidelberg colorectal data set for surgical data science in the sensor operating room , 2020, Scientific Data.

[13]  Josien P. W. Pluim,et al.  clDice - a Novel Topology-Preserving Loss Function for Tubular Structure Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  L. Maier-Hein,et al.  Methods and open-source toolkit for analyzing and visualizing challenge results , 2019, Scientific reports.

[15]  Bingbing Ni,et al.  Deep-learning-assisted detection and segmentation of rib fractures from CT scans: Development and validation of FracNet , 2020, EBioMedicine.

[16]  Bernhard Kainz,et al.  Surface Agnostic Metrics for Cortical Volume Segmentation and Regression , 2020, MLCN/RNO-AI@MICCAI.

[17]  Giorgio Visani,et al.  Metrics for Multi-Class Classification: an Overview , 2020, ArXiv.

[18]  Qiuming Zhu,et al.  On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset , 2020, Pattern Recognit. Lett..

[19]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.

[20]  Peter A. Calabresi,et al.  Evaluating White Matter Lesion Segmentations with Refined Sørensen-Dice Analysis , 2020, Scientific Reports.

[21]  Yasushi Okada,et al.  Robust classification of cell cycle phase and biological feature extraction by image-based deep learning , 2020, Molecular biology of the cell.

[22]  Lena Maier-Hein,et al.  Surgical spectral imaging , 2020, Medical Image Anal..

[23]  J. Mongan,et al.  Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. , 2020, Radiology. Artificial intelligence.

[24]  Marius Pachitariu,et al.  Cellpose: a generalist algorithm for cellular segmentation , 2020, Nature Methods.

[25]  Dagmar Kainmueller,et al.  PatchPerPix for Instance Segmentation , 2020, ECCV.

[26]  Brent van der Heyden,et al.  Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy , 2019, Physics and imaging in radiation oncology.

[27]  Lena Maier-Hein,et al.  BIAS: Transparent reporting of biomedical image analysis challenges , 2019, Medical Image Analysis.

[28]  Paul F. Jäger Challenges and Opportunities of End-to-End Learning in Medical Image Classification , 2020 .

[29]  Yu Xue,et al.  DeepPhagy: a deep learning framework for quantitatively measuring autophagy activity in Saccharomyces cerevisiae , 2020, Autophagy.

[30]  Anne E Carpenter,et al.  Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl , 2019, Nature Methods.

[31]  Pål Halvorsen,et al.  VISEM: a multimodal video dataset of human spermatozoa , 2019, MMSys.

[32]  Bhiksha Raj,et al.  Non-Determinism in Neural Networks for Adversarial Robustness , 2019, ArXiv.

[33]  Gary S. Collins,et al.  Reporting of artificial intelligence prediction models , 2019, The Lancet.

[34]  Erik Meijering,et al.  3-D Quantification of Filopodia in Motile Cancer Cells , 2019, IEEE Transactions on Medical Imaging.

[35]  Ronald M. Summers,et al.  A large annotated medical image dataset for the development and evaluation of segmentation algorithms , 2019, ArXiv.

[36]  Noel C. F. Codella,et al.  Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC) , 2019, ArXiv.

[37]  J. Llach,et al.  Computer-aided prediction of polyp histology on white light colonoscopy using surface pattern analysis , 2018, Endoscopy.

[38]  Catarina Eloy,et al.  BACH: Grand Challenge on Breast Cancer Histology Images , 2018, Medical Image Anal..

[39]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Paul Aljabar,et al.  Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test , 2018, Medical physics.

[41]  Aymeric Histace,et al.  GTCreator: a flexible annotation tool for image-based datasets , 2018, International Journal of Computer Assisted Radiology and Surgery.

[42]  Lena Maier-Hein,et al.  How to Exploit Weaknesses in Biomedical Challenge Design and Organization , 2018, MICCAI.

[43]  Martin Styner,et al.  Objective Evaluation of Multiple Sclerosis Lesion Segmentation using a Data Management and Processing Infrastructure , 2018, bioRxiv.

[44]  Eugene W. Myers,et al.  Cell Detection with Star-convex Polygons , 2018, MICCAI.

[45]  Aaron Carass,et al.  Why rankings of biomedical image analysis competitions should be interpreted with care , 2018, Nature Communications.

[46]  Anne E Carpenter,et al.  Evaluation of Deep Learning Strategies for Nucleus Segmentation in Fluorescence Images , 2018, bioRxiv.

[47]  Sebastian Bickelhaupt,et al.  Radiomics Based on Adapted Diffusion Kurtosis Imaging Helps to Clarify Most Mammographic Findings Suspicious for Cancer. , 2018, Radiology.

[48]  Irina Voiculescu,et al.  Family of boundary overlap metrics for the evaluation of medical image segmentation , 2018, Journal of medical imaging.

[49]  Andrew H. Beck,et al.  Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer , 2017, JAMA.

[50]  Nathalie Harder,et al.  An Objective Comparison of Cell Tracking Algorithms , 2017, Nature Methods.

[51]  Barry J. Dickson,et al.  The VT GAL4, LexA, and split-GAL4 driver line collections for targeted expression in the Drosophila nervous system , 2017, bioRxiv.

[52]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[53]  Hao Chen,et al.  Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge , 2016, Medical Image Anal..

[54]  Konstantinos Kamnitsas,et al.  SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound , 2016, IEEE Transactions on Medical Imaging.

[55]  Ewout W Steyerberg,et al.  Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests , 2016, British Medical Journal.

[56]  M. Maška,et al.  Cell Tracking Accuracy Measurement Based on Comparison of Acyclic Oriented Graphs , 2015, PloS one.

[57]  Lena Maier-Hein,et al.  The HCI Stereo Metrics: Geometry-Aware Performance Analysis of Stereo Algorithms , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[58]  Allan Hanbury,et al.  Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool , 2015, BMC Medical Imaging.

[59]  Suliana Manley,et al.  Quantitative evaluation of software packages for single-molecule localization microscopy , 2015, Nature Methods.

[60]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[61]  Barbara Zitová,et al.  Performance evaluation of image segmentation algorithms on microscopic image data , 2015, Journal of microscopy.

[62]  Irina Voiculescu,et al.  An Overview of Current Evaluation Methods Used in Medical Image Segmentation , 2015 .

[63]  Lihi Zelnik-Manor,et al.  How to Evaluate Foreground Maps , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[65]  William J. Godinez,et al.  Objective comparison of particle tracking methods , 2014, Nature Methods.

[66]  Stephen M. Moore,et al.  The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository , 2013, Journal of Digital Imaging.

[67]  Ben Glocker,et al.  Discriminative Segmentation-Based Evaluation Through Shape Dissimilarity , 2012, IEEE Transactions on Medical Imaging.

[68]  Matthijs J. Warrens,et al.  Some Paradoxical Results for the Quadratically Weighted Kappa , 2012 .

[69]  Richard C. Pais,et al.  The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. , 2011, Medical physics.

[70]  A. Hrõbjartsson,et al.  Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. , 2011, Journal of clinical epidemiology.

[71]  Ilaria Gori,et al.  Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: The ANODE09 study , 2010, Medical Image Anal..

[72]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[73]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[74]  David Gur,et al.  Area under the Free‐Response ROC Curve (FROC) and a Related Summary Index , 2009, Biometrics.

[75]  P. Mezey,et al.  On the Balance of Simplification and Reality in Molecular Modeling of the Electron Density. , 2008, Journal of chemical theory and computation.

[76]  Iveta Simera,et al.  EQUATOR: reporting guidelines for health research , 2008, Open medicine : a peer-reviewed, independent, open-access journal.

[77]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[78]  David A. van Leeuwen,et al.  An Introduction to Application-Independent Evaluation of Speaker Recognition Systems , 2007, Speaker Classification.

[79]  Pierre Jannin,et al.  Model for defining and reporting reference-based validation protocols in medical image processing , 2006, International Journal of Computer Assisted Radiology and Surgery.

[80]  Frank W. Samuelson,et al.  Comparing image detection algorithms using resampling , 2006, 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006..

[81]  Staðlaráð Íslands,et al.  Gæðastjórnunarkerfi : grunnatriði og íðorðasafn = Quality Management Systems : fundamentals and vocabulary. , 2006 .

[82]  Fernando Pereira,et al.  Video Object Relevance Metrics for Overall Segmentation Quality Evaluation , 2006, EURASIP J. Adv. Signal Process..

[83]  John Attia,et al.  Moving beyond sensitivity and specificity: using likelihood ratios to help interpret diagnostic tests , 2003 .

[84]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[85]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[86]  Teddy Seidenfeld,et al.  Calibration, Coherence, and Scoring Rules , 1985, Philosophy of Science.

[87]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[88]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[89]  Bernice B Brown,et al.  DELPHI PROCESS: A METHODOLOGY USED FOR THE ELICITATION OF OPINIONS OF EXPERTS , 1968 .

[90]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[91]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[92]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .