Estimation of missing values in clinical laboratory measurements of ICU patients using a weighted K-nearest neighbors algorithm

In the modern intensive care unit (ICU), the physiologic state of critically-ill patients is monitored through a diverse array of biosensors and laboratory measurements. The sheer volume of data that is collected has overwhelmed clinicians charged with assimilating and transforming the data into clinical hypotheses. The development of automated algorithms with vigilant monitoring and clinical decision-support capabilities would help to alleviate this "information-overload" challenge. The inherent noise and measurement error is an added level of complication to the real-time analysis and interpretation of medical data. One class of "noise" in medical data can be characterized by the absence or unavailability of a desired measurement. We have analyzed a large collection of clinical laboratory data (blood chemistry, blood gasses, complete blood counts) from over 600 ICU/CCU patients in the MIMIC II database. An analysis of the frequency of missing data values across patient records for each measurement was completed. Furthermore, we have developed a novel method to estimate the values of missing data by the use of a weighted K-nearest neighbors algorithm. We propose a weighting scheme that exploits the correlation between a "missing" dimension and available data values from other fields. We compare our technique with several popular missing value estimation techniques: principal components analysis, least squares estimation, mean imputation, and classical k-nearest neighbors. The mean standardized imputation error ranges from a minimum of 0.31 to a maximum, of 0.75 depending on the imputed dimension. The mean standardized imputation error over all dimensions is 0.45.