New approaches to measuring the performance of programs that generate differential diagnoses using ROC curves and other metrics

INTRODUCTION Evaluation of computer programs which generate multiple diagnoses can be hampered by a lack of effective, well recognized performance metrics. We have developed a method to calculate mean sensitivity and specificity for multiple diagnoses and generate ROC curves. METHODS Data came from a clinical evaluation of the Heart Disease Program (HDP). Sensitivity, specificity, positive and negative predictive value (PPV, NPV) were calculated for each diagnosis type in the study. A weighted mean of overall sensitivity and specificity was derived and used to create an ROC curve. Alternative metrics Comprehensiveness and Relevance were calculated for each case and compared to the other measures. RESULTS Weighted mean sensitivity closely matched Comprehensiveness and mean PPV matched Relevance. Plotting the Physician's sensitivity and specificity on the ROC curve showed that their discrimination was similar to the HDP but sensitivity was significantly lower. CONCLUSIONS These metrics give a clear picture of a program's diagnostic performance and allow straightforward comparison between different programs and different studies.