An architecture for component-based design of representative-based clustering algorithms

We propose an architecture for the design of representative-based clustering algorithms based on reusable components. These components were derived from K-means-like algorithms and their extensions. With the suggested clustering design architecture, it is possible to reconstruct popular algorithms, but also to build new algorithms by exchanging components from original algorithms and their improvements. In this way, the design of a myriad of representative-based clustering algorithms and their fair comparison and evaluation are possible. In addition to the architecture, we show the usefulness of the proposed approach by providing experimental evaluation.

[1]  Mihai Lazarescu,et al.  Incremental clustering of dynamic data streams using connectivity based representative points , 2009, Data Knowl. Eng..

[2]  William B. Frakes,et al.  Software reuse research: status and future , 2005, IEEE Transactions on Software Engineering.

[3]  Larry S. Davis,et al.  Class consistent k-means: Application to face and action recognition , 2012, Comput. Vis. Image Underst..

[4]  Christian Böhm,et al.  Detection of Arbitrarily Oriented Synchronized Clusters in High-Dimensional Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[5]  Deepak Khemani,et al.  Interpretable and reconfigurable clustering of document datasets by deriving word-based rules , 2011, Knowledge and Information Systems.

[6]  Remco J. Renken,et al.  Group analyses of connectivity-based cortical parcellation using repeated k-means clustering , 2009, NeuroImage.

[7]  Yiu-ming Cheung,et al.  k*-Means: A new generalized k-means clustering algorithm , 2003, Pattern Recognit. Lett..

[8]  David Sarne,et al.  Sleeved co-clustering of lagged data , 2012, Knowledge and Information Systems.

[9]  Zoran Obradovic,et al.  A method for design of data-tailored partitioning algorithms for optimizing the number of clusters in microarray analysis , 2012, 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[10]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[11]  Germain Forestier,et al.  Collaborative clustering with background knowledge , 2010, Data Knowl. Eng..

[12]  Kathrin Kirchner,et al.  Reusable components for partitioning clustering algorithms , 2009, Artificial Intelligence Review.

[13]  Pierre Hansen,et al.  Analysis of Global k-Means, an Incremental Heuristic for Minimum Sum-of-Squares Clustering , 2005, J. Classif..

[14]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[15]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[16]  Wei Xu,et al.  New fuzzy c-means clustering model based on the data weighted approach , 2010, Data Knowl. Eng..

[17]  Hui Xiong,et al.  Scaling up top-K cosine similarity search , 2011, Data Knowl. Eng..

[18]  Zoran Obradovic,et al.  Component-based decision trees for classification , 2011, Intell. Data Anal..

[19]  Jian Jhen Chen,et al.  K-means clustering versus validation measures: a data-distribution perspective. , 2009, IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society.

[20]  Carl E. Rasmussen,et al.  The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[21]  Frank S. C. Tseng,et al.  An integration of WordNet and fuzzy association rule mining for multi-label document clustering , 2010, Data Knowl. Eng..

[22]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[23]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[24]  Ying Li,et al.  An Improved K-Means Based Method for Fingerprint Segmentation with Sensor Interoperability , 2011 .

[25]  James Bailey,et al.  A hierarchical information theoretic technique for the discovery of non linear alternative clusterings , 2010, KDD.

[26]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[27]  Ian Witten,et al.  Data Mining , 2000 .

[28]  Kathrin Kirchner,et al.  A Pattern Based Data Mining Approach , 2007, GfKl.

[29]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[30]  Elena Baralis,et al.  Measuring gene similarity by means of the classification distance , 2011, Knowledge and Information Systems.

[31]  Adil M. Bagirov,et al.  Modified global k-means algorithm for minimum sum-of-squares clustering problems , 2008, Pattern Recognit..

[32]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[33]  Alfredo Cuzzocrea Advanced knowledge-based systems , 2010, Data Knowl. Eng..

[34]  Mothd Belal Al-Daoud A New Algorithm for Cluster Initialization , 2005, WEC.

[35]  Giuseppe De Pietro,et al.  Formal design and implementation of constraints in software components , 2010, Adv. Eng. Softw..

[36]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[37]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[38]  Madhu Yedla,et al.  Enhancing K-means Clustering Algorithm with Improved Initial Center , 2010 .

[39]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[40]  Ranjan Maitra Initializing Partition-Optimization Algorithms , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[42]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[43]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[44]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[45]  Yvan Saeys,et al.  Java-ML: A Machine Learning Library , 2009, J. Mach. Learn. Res..

[46]  V. Saravanan,et al.  An Increased Performance of Clustering High Dimensional Data Using Principal Component Analysis , 2010, 2010 First International Conference on Integrated Intelligent Computing.

[47]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[48]  Raja Chiky,et al.  A clustering approach for sampling data streams in sensor networks , 2012, 2010 IEEE International Conference on Data Mining.

[49]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[50]  Jian Liu,et al.  Comparative Analysis for k-Means Algorithms in Network Community Detection , 2010, ISICA.

[51]  Lipika Dey,et al.  A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets , 2011, Pattern Recognit. Lett..

[52]  Christos Bouras,et al.  W-kmeans: Clustering News Articles Using WordNet , 2010, KES.

[53]  Mehmet Fatih Amasyali,et al.  Clustering Application Benchmark , 2006, 2006 IEEE International Symposium on Workload Characterization.

[54]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[55]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[56]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[57]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[58]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[59]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[60]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[61]  Kate Smith-Miles,et al.  Towards insightful algorithm selection for optimisation using meta-learning concepts , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[62]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[63]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[64]  Elke Achtert,et al.  ELKI: A Software System for Evaluation of Subspace Clustering Algorithms , 2008, SSDBM.

[65]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[66]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[67]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[68]  Dusan Starcevic,et al.  Wiki as a corporate learning tool: case study for software development company , 2012, Behav. Inf. Technol..

[69]  G. W. Milligan,et al.  Methodology Review: Clustering Methods , 1987 .

[70]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[71]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[72]  Cong Wang,et al.  Web user clustering and Web prefetching using Random Indexing with weight functions , 2011, Knowledge and Information Systems.

[73]  Kalman J. Cohen,et al.  Inter-Temporal Portfolio Analysis Based on Simulation of Joint Returns , 1967 .

[74]  Jim Z. C. Lai,et al.  Fast global k-means clustering using cluster membership and inequality , 2010, Pattern Recognit..

[75]  Estivill-CastroVladimir Why so many clustering algorithms , 2002 .

[76]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[77]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[78]  Yi Pan,et al.  Using Hybrid Hierarchical K-means (HHK) clustering algorithm for protein sequence motif Super-Rule-Tree (SRT) structure construction , 2010, Int. J. Data Min. Bioinform..

[79]  Ossama Younis,et al.  FlowMate: scalable on-line flow clustering , 2005, IEEE/ACM Transactions on Networking.

[80]  Hassan Abolhassani,et al.  Harmony K-means algorithm for document clustering , 2009, Data Mining and Knowledge Discovery.

[81]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[82]  Argyris Kalogeratos,et al.  Document clustering using synthetic cluster prototypes , 2011, Data Knowl. Eng..

[83]  Katharina Morik,et al.  Automatic Feature Extraction for Classifying Audio Data , 2005, Machine Learning.

[84]  K. alik An efficient k'-means clustering algorithm , 2008 .

[85]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[86]  Tijl De Bie,et al.  An information theoretic framework for data mining , 2011, KDD.

[87]  Marc Teboulle,et al.  Grouping Multidimensional Data - Recent Advances in Clustering , 2006 .

[88]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[89]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[90]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[91]  Zoran Obradovic,et al.  Internal Evaluation Measures as Proxies for External Indices in Clustering Gene Expression Data , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[92]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[93]  Ivan G. Costa,et al.  Mining Rules for the Automatic Selection Process of Clustering Methods Applied to Cancer Gene Expression Data , 2009, ICANN.

[94]  Will Tracz Where does reuse start? , 1990, SOEN.