The effect of measurement error on clustering algorithms

Clustering consists of a popular set of techniques used to separate data into interesting groups for further analysis. Many data sources on which clustering is performed are well-known to contain random and systematic measurement errors. Such errors may adversely affect clustering. While several techniques have been developed to deal with this problem, little is known about the effectiveness of these solutions. Moreover, no work to-date has examined the effect of systematic errors on clustering solutions. In this paper, we perform a Monte Carlo study to investigate the sensitivity of two common clustering algorithms, GMMs with merging and DBSCAN, to random and systematic error. We find that measurement error is particularly problematic when it is systematic and when it affects all variables in the dataset. For the conditions considered here, we also find that the partition-based GMM with merged components is less sensitive to measurement error than the density-based DBSCAN procedure.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Brian Everitt,et al.  Cluster analysis , 1974 .

[3]  Shai Shalev-Shwartz,et al.  Decoupling "when to update" from "how to update" , 2017, NIPS.

[4]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[5]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[6]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[7]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[8]  L. Crocker,et al.  Introduction to Classical and Modern Test Theory , 1986 .

[9]  Daniel Pressel,et al.  A Nonlinear, Noise-aware, Quasi-clustering Approach to Learning Deep CNNs from Noisy Labels , 2019, CVPR Workshops.

[10]  Bart F.M. Bakker,et al.  How Linkage Error Affects Hidden Markov Model Estimates: A Sensitivity Analysis , 2019, Journal of Survey Statistics and Methodology.

[11]  M. Aldenderfer Cluster Analysis , 1984 .

[12]  Jeroen K. Vermunt,et al.  Measuring temporary employment. Do survey or register data tell the truth , 2015 .

[13]  D. Goodin The cambridge dictionary of statistics , 1999 .

[14]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[15]  Daniel L. Oberski Beyond the number of classes: separating substantive from non-substantive dependence in latent class analysis , 2015, Advances in Data Analysis and Classification.

[16]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[17]  Li-Chun Zhang,et al.  Topics of statistical theory for register‐based statistics and data integration , 2012 .

[18]  Bart F.M. Bakker,et al.  Reconciliation of inconsistent data sources by correction for measurement error: The feasibility of parameter re-use , 2017, Statistical Journal of the IAOS.

[19]  C. Matr'an,et al.  A general trimming approach to robust Cluster Analysis , 2008, 0806.2976.

[20]  Mariana Batista da Silva,et al.  Cluster Analysis for Political Scientists , 2014 .

[21]  D. Hall Measurement Error in Nonlinear Models: A Modern Perspective , 2008 .

[22]  Rajesh N. Davé,et al.  Robust clustering methods: a unified view , 1997, IEEE Trans. Fuzzy Syst..

[23]  P. Levy Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments , 2004 .

[24]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[25]  Charles Bouveyron,et al.  Model-Based Clustering and Classification for Data Science: With Applications in R , 2019 .

[26]  C. Anderson‐Cook The Cambridge Dictionary of Statistics (2nd ed.) , 2003 .

[27]  P M Vacek,et al.  The effect of conditional dependence on the evaluation of diagnostic tests. , 1985, Biometrics.

[28]  David Nelson,et al.  Noise Web Data Learning from a Web User Profile: Position Paper , 2017 .

[29]  J. Vermunt Commentary on 'An analysis of classification error for the revised current population survey employment questions' , 2004 .

[30]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[31]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[32]  Jacob Goldberger,et al.  Training deep neural-networks using a noise adaptation layer , 2016, ICLR.

[33]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[34]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[35]  Mahesh Kumar,et al.  Clustering data with measurement errors , 2007, Comput. Stat. Data Anal..

[36]  Michael Anyadike-Danes,et al.  Predicting successful and unsuccessful transitions from school to work by using sequence methods , 2002 .

[37]  Hichem Frigui,et al.  A robust algorithm for automatic extraction of an unknown number of clusters from noisy data , 1996, Pattern Recognit. Lett..

[38]  Nandini Dendukuri,et al.  Evaluation of Screening Tests for Detecting Chlamydia trachomatis: Bias Associated With the Patient-infected-status Algorithm , 2012, Epidemiology.

[39]  S D Walter,et al.  Effects of dependent errors in the assessment of diagnostic test performance. , 1997, Statistics in medicine.

[40]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[41]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[42]  Douglas A. Reynolds,et al.  Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[43]  Rajesh N. Davé,et al.  Characterization and detection of noise in clustering , 1991, Pattern Recognit. Lett..

[44]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[45]  Johannes B Reitsma,et al.  Problems in detecting misfit of latent class models in diagnostic research without a gold standard were shown. , 2016, Journal of clinical epidemiology.

[46]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[47]  D. Lassen,et al.  Income and OutcomesSocial Desirability Bias Distorts Measurements of the Relationship between Income and Political Behavior , 2017 .

[48]  P. Daas,et al.  Methodological challenges of register‐based research , 2012 .

[49]  Sulekha Goyat,et al.  The basis of market segmentation: a critical review of literature , 2011 .

[50]  N. Schwarz,et al.  Thinking About Answers: The Application of Cognitive Processes to Survey Methodology , 1995, Quality of Life Research.

[51]  Raffaella Piccarreta,et al.  Clustering work and family trajectories by using a divisive algorithm , 2007 .

[52]  M. Cugmas,et al.  On comparing partitions , 2015 .

[53]  Christian Hennig,et al.  What are the true clusters? , 2015, Pattern Recognit. Lett..