A Cognitively Inspired Approach to Two-Way Cluster Extraction from One-Way Clustered Data

Cluster extraction is a vital part of data mining; however, humans and computers perform it very differently. Humans tend to estimate, perceive or visualize clusters cognitively, while digital computers either perform an exact extraction, follow a fuzzy approach, or organize the clusters in a hierarchical tree. In real data sets, the clusters are not only of different densities, but have embedded noise and are nested, thus making their extraction more challenging. In this paper, we propose a density-based technique for extracting connected rectangular clusters that may go undetected by traditional cluster extraction techniques. The proposed technique is inspired by the human cognition approach of appropriately scaling the level of detail, by going from low level of detail, i.e., one-way clustering to high level of detail, i.e., biclustering, in the dimension of interest, as in online analytical processing. A number of experiments were performed using simulated and real data sets and comparison of the proposed technique made with four popular cluster extraction techniques (DBSCAN, CLIQUE, k-medoids and k-means) with promising results.

[1]  Rufus Walker,et al.  An enumerative technique for a class of combinatorial problems , 1960 .

[2]  Harold Gulliksen,et al.  Contributions to mathematical psychology , 1964 .

[3]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[4]  J. Orlin Contentment in graph theory: Covering graphs with cliques , 1977 .

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[7]  Beng Chin Ooi,et al.  Discovery of General Knowledge in Large Spatial Databases , 1993 .

[8]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[9]  Max J. Egenhofer,et al.  Advances in Spatial Databases , 1997, Lecture Notes in Computer Science.

[10]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.

[11]  Erich Schikuta,et al.  Grid-clustering: an efficient hierarchical clustering method for very large data sets , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[12]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  E. F. Codd,et al.  Providing OLAP to User-Analysts: An IT Mandate , 1998 .

[17]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[18]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[19]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[20]  Ben Shneiderman,et al.  Readings in information visualization - using vision to think , 1999 .

[21]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[22]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[23]  Ravindra Khattree,et al.  Multivariate Data Reduction and Discrimination With SAS® Software , 2001 .

[24]  Cao Jing,et al.  Approaches for scaling DBSCAN algorithm to large spatial databases , 2000 .

[25]  Samuel Kaski,et al.  Data Visualization and Analysis with Self-Organizing Maps in Learning Metrics , 2001, DaWaK.

[26]  Huan Liu,et al.  '1+1>2': merging distance and density based clustering , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[27]  Dah-Jye Lee,et al.  Three-dimensional reconstruction for high-speed volume measurement , 2001, SPIE Optics East.

[28]  Huan Liu,et al.  Merging Distance and Density Based Clustering , 2001 .

[29]  Yvan Bédard,et al.  Toward better support for spatial decision making: Defining the characteristics of spatial on-line analytical processing (SOLAP) , 2001 .

[30]  Jeng-Shyang Pan,et al.  An Efficient K -Medoids-Based Algorithm Using Previous Medoid Index, Triangular Inequality Elimination Criteria, and Partial Distance Search , 2002, DaWaK.

[31]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[32]  周傲英,et al.  Clustering in very large databases based on distance and density , 2008, Journal of Computer Science and Technology.

[33]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[34]  Keke Chen,et al.  VISTA: Validating and Refining Clusters Via Visualization , 2004, Inf. Vis..

[35]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[36]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[38]  Antonio Varlaro,et al.  Spatial Clustering of Related Structured Objects for Topographic Map Interpretation , 2005 .

[39]  David Taniar,et al.  Computational Science and Its Applications - ICCSA 2005, International Conference, Singapore, May 9-12, 2005, Proceedings, Part I , 2005, ICCSA.

[40]  Qiaoping Zhang,et al.  A New and Efficient K-Medoid Algorithm for Spatial Clustering , 2005, ICCSA.

[41]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[42]  Bing Liu,et al.  A Fast Density-Based Clustering Algorithm for Large Databases , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[43]  Amir Hussain,et al.  A new biclustering technique based on crossing minimization , 2006, Neurocomputing.

[44]  P. Viswanath,et al.  l-DBSCAN : A Fast Hybrid Density Based Clustering Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[45]  Kavé Salamatian,et al.  Flexible Grid-Based Clustering , 2007, PKDD.

[46]  Elke Achtert,et al.  Robust Clustering in Arbitrarily Oriented Subspaces , 2008, SDM.

[47]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[48]  Cesim Erten,et al.  Biclustering Expression Data Based on Expanding Localized Substructures , 2009, BICoB.

[49]  Chen Xiaoyun,et al.  PGMCLU: A novel parallel grid-based clustering algorithm for multi-density datasets , 2009, 2009 1st IEEE Symposium on Web Society.

[50]  Jaideep Srivastava,et al.  Unsupervised Learning Based Distributed Detection of Global Anomalies , 2010, Int. J. Inf. Technol. Decis. Mak..

[51]  Jianhao Tan,et al.  An Improved Clustering Algorithm Based on Density Distribution Function , 2010, Comput. Inf. Sci..

[52]  Matthew O. Ward,et al.  Summarization and Matching of Density-Based Clusters in Streaming Environments , 2011, Proc. VLDB Endow..

[53]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[54]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[55]  Cameron Buckner A property cluster theory of cognition , 2015 .

[56]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.