Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data

Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.

[1]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[2]  J. Chudek,et al.  Interrelation between genotypes of the vitamin D receptor gene and serum sex hormone concentrations in the Polish elderly population: the PolSenior study , 2014, Experimental Gerontology.

[3]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[4]  Ji Zhang,et al.  Advancements of Outlier Detection: A Survey , 2013, EAI Endorsed Trans. Scalable Inf. Syst..

[5]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[6]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[7]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[8]  Jerzy Chudek,et al.  Medical, psychological and socioeconomic aspects of aging in Poland Assumptions and objectives of the PolSenior project , 2011, Experimental Gerontology.

[9]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[10]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[11]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[12]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[13]  A. Seftel Male hypogonadism. Part II: etiology, pathophysiology, and diagnosis , 2006, International Journal of Impotence Research.

[14]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[15]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[18]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[19]  A. Madansky Identification of Outliers , 1988 .

[20]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[21]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[22]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[23]  K. Gabriel,et al.  The biplot graphic display of matrices with application to principal component analysis , 1971 .