From plastic to gold: a unified classification scheme for reference standards in medical image processing

Reliable evaluation of medical image processing is of major importance for routine applications. Nonetheless, evaluation is often omitted or methodically defective when novel approaches or algorithms are introduced. Adopted from medical diagnosis, we define the following criteria to classify reference standards: 1. Reliance, if the generation or capturing of test images for evaluation follows an exactly determined and reproducible protocol. 2. Equivalence, if the image material or relationships considered within an algorithmic reference standard equal real-life data with respect to structure, noise, or other parameters of importance. 3. Independence, if any reference standard relies on a different procedure than that to be evaluated, or on other images or image modalities than that used routinely. This criterion bans the simultaneous use of one image for both, training and test phase. 4. Relevance, if the algorithm to be evaluated is self-reproducible. If random parameters or optimization strategies are applied, reliability of the algorithm must be shown before the reference standard is applied for evaluation. 5. Significance, if the number of reference standard images that are used for evaluation is sufficient large to enable statistically founded analysis. We demand that a true gold standard must satisfy the Criteria 1 to 3. Any standard only satisfying two criteria, i.e., Criterion 1 and Criterion 2 or Criterion 1 and Criterion 3, is referred to as silver standard. Other standards are termed to be from plastic. Before exhaustive evaluation based on gold or silver standards is performed, its relevance must be shown (Criterion 4) and sufficient tests must be carried out to found statistical analysis (Criterion 5). In this paper, examples are given for each class of reference standards.