A Methodology for Evaluating the Performance of Alerting and Detection Algorithms Running on Continuous Patient Data

Objectives Clinicians in the intensive care unit (ICU) are presented with a large number of physiological data consisting of periodic and frequently sampled measurements, such as heart rate and blood pressure, as well as aperiodic measurements, such as noninvasive blood pressure and laboratory studies. Because this data can be overwhelming, there is considerable interest in designing algorithms that help integrate and interpret this data and assist ICU clinicians in detecting or predicting in advance patients who may be deteriorating. In order to decide whether to deploy such algorithms in a clinical trial, it is important to evaluate these algorithms using retrospective data. However, the fact that these algorithms will be running continuously, i.e., repeatedly sampling incoming patient data, presents some novel challenges for algorithm evaluation. Commonly used measures of performance such as sensitivity and positive predictive value (PPV) are easily applied to static “snapshots” of patient data, but can be very misleading when applied to indicators or alerting algorithms that are running on continuous data. Our objective is to create a method for evaluating algorithm performance on retrospective data with the algorithm running continuously throughout the patient’s stay as it would in a real ICU. Methods We introduce our evaluation methodology in the context of evaluating an algorithm, a Hemodynamic Instability Indicator (HII), for assisting bedside ICU clinicians with the early detection of hemodynamic instability before the onset of acute hypotension. Each patient’s ICU stay is divided into segments that are labelled as hemodynamically stable or unstable based on clinician interventions typically aimed at treating hemodynamic instability. These segments can be of varying length with varying degrees of exposure to potential alerts, whether true positive or false positive. Furthermore, to simulate how clinicians might interact with the alerting algorithm, we use a dynamic alert supervision mechanism which suppresses subsequent alerts unless the indicator has significantly deteriorated since the prior alert. Under these conditions determining what counts as a positive or negative instance, and calculations of sensitivity, specificity, and positive predictive value can be problematic. We introduce a methodology for consistently counting positive and negative instances. The methodology distinguishes between counts based on alerting events and counts based on sub-segments, and show how they can be applied in calculating measures of performance such as sensitivity, specificity, positive predictive value. Results The introduced methodology is applied to retrospective evaluation of two algorithms, HII and an alerting algorithm based on systolic blood pressure. We use a database, consisting of data from 41,707 patients from 25 US hospitals, to evaluate the algorithms. Both algorithms are evaluated running continuously throughout each patient’s stay as they would in a real ICU setting. We show how the introduced performance measures differ for different algorithms and for different assumptions. Discussion The standard measures of diagnostic tests in terms of true positives, false positives, etc. are based on certain assumptions which may not apply when used in the context of measuring the performance on an algorithm running continuously, and thus repeatedly sampling from the same patient. When such measures are being reported it is important that the underlying assumptions be made explicit; otherwise, the results can be very misleading. Conclusion We introduce a methodology for evaluating how an alerting algorithm or indicator will perform running continuously throughout every patient’s ICU stay, not just for a subset of patients for selected episodes.