A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms

Most correlation clustering algorithms rely on principal component analysis (PCA) as a correlation analysis tool. The correlation of each cluster is learned by applying PCA to a set of sample points. Since PCA is rather sensitive to outliers, if a small fraction of these points does not correspond to the correct correlation of the cluster, the algorithms are usually misled or even fail to detect the correct results. In this paper, we evaluate the influence of outliers on PCA and propose a general framework for increasing the robustness of PCA in order to determine the correct correlation of each cluster. We further show how our framework can be applied to PCA-based correlation clustering algorithms. A thorough experimental evaluation shows the benefit of our framework on several synthetic and real-world data sets.

[1]  Bernhard Liebl,et al.  Very high compliance in an expanded MS-MS-based newborn screening program despite written parental consent. , 2002, Preventive medicine.

[2]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[3]  Elke Achtert,et al.  On Exploring Complex Relationships of Correlation Clusters , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[4]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[5]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[6]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[8]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[9]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[10]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[11]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[12]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[13]  Elke Achtert,et al.  Mining Hierarchies of Correlation Clusters , 2006, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Anthony K. H. Tung,et al.  CURLER: finding and visualizing nonlinear correlation clusters , 2005, SIGMOD '05.

[16]  Elke Achtert,et al.  Robust, Complete, and Efficient Correlation Clustering , 2007, SDM.