Use of cluster separation indices and the influence of outliers: application of two new separation indices, the modified silhouette index and the overlap coefficient to simulated data and mouse urine metabolomic profiles

To quantify separate classes, four indices are compared namely the Davies Bouldin index, the silhouette width and two new approaches described in this paper, the modified silhouette width index based on the proportion of objects with a positive silhouette width and the Overlap Coefficient. Four sets of simulated datasets are described, each in turn, consisting of 15 sets of data of varying degrees of overlap, and differing in the nature of outliers. Three experimental datasets consisting of the gas chromatography mass spectrometry of extracts from mouse urine obtained to study the effect of different environmental (stress), physiological (diet) and developmental (age) factors on their metabolic profiles are also described. The paper discusses the robustness of each approach to outliers, and to allow assessment of class separation for each index. The two modifications protect against outliers. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Miin-Shen Yang,et al.  A cluster validity index for fuzzy clustering , 2005, Pattern Recognit. Lett..

[2]  Tokio Yamaguchi,et al.  Psychological stress increases bilirubin metabolites in human urine. , 2002, Biochemical and biophysical research communications.

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  C. M. Singer,et al.  Human Melatonin Production Decreases With Age , 1986, Journal of pineal research.

[5]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[6]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[7]  T. Kariya A Robustness Property of Hotelling's $T^2$-Test , 1981 .

[8]  Ronald W. Butler,et al.  Nonparametric Interval and Point Prediction Using Data Trimmed by a Grubbs-Type Outlier Rule , 1982 .

[9]  Richard G. Brereton,et al.  Pattern Recognition of Gas Chromatography Mass Spectrometry of Human Volatiles in Sweat to distinguish the sex of subjects and determine potential Discriminatory Marker Peaks , 2007 .

[10]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[11]  Ke-Hai Yuan,et al.  F Tests for Mean and Covariance Structure Analysis , 1999 .

[12]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[13]  Y. Heyden,et al.  Robust statistics in data analysis — A review: Basic concepts , 2007 .

[14]  Ian D Wilson,et al.  Metabonomic analysis of mouse urine by liquid-chromatography-time of flight mass spectrometry (LC-TOFMS): detection of strain, diurnal and gender differences. , 2003, The Analyst.

[15]  Kay I Penny,et al.  A comparison of multivariate outlier detection methods for clinical laboratory safety data , 2001 .

[16]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[17]  D. Penn,et al.  An automated method for peak detection and matching in large gas chromatography‐mass spectrometry data sets , 2006 .

[18]  Brian Everitt,et al.  A Monte Carlo Investigation of the Robustness of Hotelling's One- and Two-Sample T 2 Tests , 1979 .

[19]  Fan Gong,et al.  Application of dissimilarity indices, principal coordinates analysis, and rank tests to peak tables in metabolomics of the gas chromatography/mass spectrometry of human sweat. , 2007, Analytical chemistry.

[20]  M. Jhun,et al.  Asymptotics for the minimum covariance determinant estimator , 1993 .