The effect of survey measurement error on clustering algorithms

Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.