A K-Means Approach to Clustering Disease Progressions

K-means algorithm has been a workhorse of unsupervised machine learning for many decades, primarily owing to its simplicity and efficiency. The algorithm requires availability of two key operations on the data, first, a distance metric to compare a pair of data objects, and second, a way to compute a representative (centroid) for a given set of data objects. These two requirements mean that k-means cannot be readily applied to time series data, in particular, to disease progression profiles often encountered in healthcare analysis. We present a k-means inspired approach to clustering disease progression data. The proposed method represents a cluster as a set of weights corresponding to a set of splines fitted to the time series data and uses the "goodness-of-fit" as a way to assign time series to clusters. We use the algorithm to group patients suffering from Chronic Kidney Disease (CKD) based on their disease progression profiles. A qualitative analysis of the representative profiles for the learnt clusters reveals that this simple approach can be used to identify groups of patients with interesting clinical characteristics. Additionally, we show how the representative profiles can be combined with patient's observations to obtain an accurate patient specific profile that can be used for extrapolating into the future.

[1]  Suchi Saria,et al.  Clustering Longitudinal Clinical Marker Trajectories from Electronic Health Data: Applications to Phenotyping and Endotype Discovery , 2015, AAAI.

[2]  Luxia Zhang,et al.  Serum Phosphorus and Progression of CKD and Mortality: A Meta-analysis of Cohort Studies. , 2015, American journal of kidney diseases : the official journal of the National Kidney Foundation.

[3]  Raman Arora,et al.  Disease Trajectory Maps , 2016, NIPS.

[4]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[5]  Chunhua Weng,et al.  Unsupervised Time-Series Clustering Over Lab Data for Automatic Identification of Uncontrolled Diabetes , 2016, 2016 IEEE International Conference on Healthcare Informatics (ICHI).

[6]  Bernadette A. Thomas,et al.  Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013 , 2015, The Lancet.

[7]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Peter N. Robinson,et al.  Deep phenotyping for precision medicine , 2012, Human mutation.

[10]  Dit-Yan Yeung,et al.  Mixtures of ARMA models for model-based time series clustering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[11]  Duc Thanh Anh Luong,et al.  Extracting Deep Phenotypes for Chronic Kidney Disease Using Electronic Health Records , 2017, EGEMS.

[12]  Christophe Genolini,et al.  Kml: A package to cluster longitudinal data , 2011, Comput. Methods Programs Biomed..

[13]  O. Blumenfeld,et al.  Studies of an unusual hemoglobin in patients with diabetes mellitus. , 1969, Biochemical and biophysical research communications.

[14]  Mohammad Reza Tamadon,et al.  SECONDARY HYPERPARATHYROIDISM AND CHRONIC KIDNEY DISEASE , 2013 .

[15]  Lisa M. Schilling,et al.  The DARTNet Institute: Seeking a Sustainable Support Mechanism for Electronic Data Enabled Research Networks , 2014, EGEMS.

[16]  Peter J Diggle,et al.  Real-time monitoring of progression towards renal failure in primary care patients. , 2015, Biostatistics.

[17]  Natasa Przulj,et al.  Integrative methods for analyzing big data in precision medicine , 2016, Proteomics.

[18]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[19]  Vesa Manninen,et al.  Joint Effects of Serum Triglyceride and LDL Cholesterol and HDL Cholesterol Concentrations on Coronary Heart Disease Risk in the Helsinki Heart Study: Implications for Treatment , 1992, Circulation.

[20]  Bernadette A. Thomas,et al.  Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013 , 2015, The Lancet.

[21]  Gautam Biswas,et al.  Temporal Pattern Generation Using Hidden Markov Model Based Unsupervised Classification , 1999, IDA.

[22]  C. Holden Why do women live longer than men? , 1987, Science.

[23]  H. Aiandhealt Subtyping : What It Is and Its Role in Precision Medicine , 2015 .

[24]  Ethan M Balk,et al.  K/DOQI clinical practice guidelines for chronic kidney disease: evaluation, classification, and stratification. , 2002, American journal of kidney diseases : the official journal of the National Kidney Foundation.

[25]  Anirban Chatterjee,et al.  A comparative study of serum aminotransferases in chronic kidney disease with and without end-stage renal disease: Need for new reference ranges , 2015, International journal of applied & basic medical research.

[26]  Harold I Feldman,et al.  KDOQI US commentary on the 2012 KDIGO clinical practice guideline for the evaluation and management of CKD. , 2014, American journal of kidney diseases : the official journal of the National Kidney Foundation.

[27]  C. Schmid,et al.  A new equation to estimate glomerular filtration rate. , 2009, Annals of internal medicine.