A review of conceptual clustering algorithms

Clustering is a fundamental technique in data mining and pattern recognition, which has been successfully applied in several contexts. However, most of the clustering algorithms developed so far have been focused only in organizing the collection of objects into a set of clusters, leaving the interpretation of those clusters to the user. Conceptual clustering algorithms, in addition to the list of objects belonging to the clusters, provide for each cluster one or several concepts, as an explanation of the clusters. In this work, we present an overview of the most influential algorithms reported in the field of conceptual clustering, highlighting their limitations or drawbacks. Additionally, we present a taxonomy of these methods as well as a qualitative comparison of these algorithms, regarding a set of characteristics desirable since a practical point of view, which may help in the selection of the most appropriate method for solving a problem at hand. Finally, some research lines that need to be further developed in the context of conceptual clustering are discussed.

[1]  Vassilios Petridis,et al.  Fuzzy Lattice Neurocomputing (FLN) models , 2000, Neural Networks.

[2]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[4]  Subodh Kumar Shah,et al.  A model-based conceptual clustering of moving objects in video surveillance , 2007, Electronic Imaging.

[5]  José Francisco Martínez Trinidad,et al.  Conceptual K-Means Algorithm Based on Complex Features , 2006, CIARP.

[6]  Henry Soldano,et al.  Alpha Galois lattices , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[7]  Christos Bouras,et al.  W-kmeans: Clustering News Articles Using WordNet , 2010, KES.

[8]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[9]  D. Duffy,et al.  A permutation-based algorithm for block clustering , 1991 .

[10]  Jitender S. Deogun,et al.  Conceptual clustering in information retrieval , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[11]  Xijin Tang,et al.  Text clustering using frequent itemsets , 2010, Knowl. Based Syst..

[12]  Sayan D. Sen,et al.  Real-time Optimal Selection of Multirobot Coalition Formation Algorithms using Conceptual Clustering , 2014 .

[13]  Ryszard S. Michalski,et al.  Conceptual Clustering: Inventing Goal-Oriented Classifications of Structured Objects , 1986 .

[14]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[15]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[16]  W. D. Seeman,et al.  The CLUSTER 3 system for goal-oriented conceptual clustering : method and preliminary results , 2006 .

[17]  Rafael Berlanga Llavori,et al.  On-line event and topic detection by using the compact sets clustering algorithm , 2002, J. Intell. Fuzzy Syst..

[18]  Raudel Hernández-León,et al.  Classification rule-based models for malicious activity detection , 2017, Intell. Data Anal..

[19]  Pilian He,et al.  A Study on Text Clustering Algorithms Based on Frequent Term Sets , 2005, ADMA.

[20]  Fredrik Kilander,et al.  COBBIT - A Control Procedure for COBWEB in the Presence of Concept Drift , 1993, ECML.

[21]  Gaël Varoquaux,et al.  A supervised clustering approach for fMRI-based inference of brain states , 2011, Pattern Recognit..

[22]  M. Pazzani,et al.  Concept formation knowledge and experience in unsupervised learning , 1991 .

[23]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[24]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[25]  Preeti Mulay,et al.  Variant of COBWEB Clustering for Privacy Preservation in Cloud DB Querying , 2015 .

[26]  George A. Papakostas,et al.  Two Fuzzy Lattice Reasoning (FLR) Classifiers and their Application for Human Facial Expression Recognition , 2014, J. Multiple Valued Log. Soft Comput..

[27]  José Hernández-Orallo,et al.  Hierarchical Distance-Based Conceptual Clustering , 2008, ECML/PKDD.

[28]  Frank S. C. Tseng,et al.  Mining fuzzy frequent itemsets for hierarchical document clustering , 2010, Inf. Process. Manag..

[29]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[30]  Andreas Stafylopatis,et al.  Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents , 2012, Comput. J..

[31]  Hae-Chang Rim,et al.  Smoothing Algorithm for N-Gram Model Using Agglutinative Characteristic of Korean , 2007 .

[32]  E. Feigenbaum The simulation of verbal learning behavior , 1899, IRE-AIEE-ACM '61 (Western).

[33]  Zhongmin Shi zshi Performance Improvement for Frequent Term-based Text Clustering Algorithm , 2004 .

[34]  José Francisco Martínez Trinidad,et al.  Extension to C-means Algorithm for the Use of Similarity Functions , 1999, PKDD.

[35]  M. Graña,et al.  LATTICE COMPUTING: LATTICE THEORY BASED COMPUTATIONAL INTELLIGENCE , 2008 .

[36]  Jean-Gabriel Ganascia,et al.  Topic Extraction with AGAPE , 2007, ADMA.

[37]  Renu Dhir,et al.  A Frequent Concepts Based Document Clustering Algorithm , 2010 .

[38]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[39]  Steffen Staab,et al.  Dl-Link: a Conceptual Clustering Algorithm for Indexing Description Logics Knowledge Bases , 2010, Int. J. Semantic Comput..

[40]  Rokia Missaoui,et al.  INCREMENTAL CONCEPT FORMATION ALGORITHMS BASED ON GALOIS (CONCEPT) LATTICES , 1995, Comput. Intell..

[41]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[42]  Henry Anaya-Sánchez,et al.  A document clustering algorithm for discovering and describing topics , 2010, Pattern Recognit. Lett..

[43]  Jiawei Han,et al.  Scalable construction of topic directory with nonparametric closed termset mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[44]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[46]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[47]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[48]  Ling Zhuang,et al.  A maximal frequent itemset approach for Web document clustering , 2004, The Fourth International Conference onComputer and Information Technology, 2004. CIT '04..

[49]  S. Durga Bhavani,et al.  Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach , 2010 .

[50]  Jan L. Talmon,et al.  An Analysis of the WITT Algorithm , 1993, Machine Learning.

[51]  Ryszard S. Michalski,et al.  Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  José Francisco Martínez Trinidad,et al.  Conceptual K-Means Algorithm with Similarity Functions , 2005, CIARP.

[53]  Henry Anaya-Sánchez,et al.  A New Document Clustering Algorithm for Topic Discovering and Labeling , 2008, CIARP.

[54]  Vikram Pudi,et al.  Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge , 2010, KES.

[55]  Kevin Thompson,et al.  Cobweb/3: A portable implementation , 1990 .

[56]  Amedeo Napoli,et al.  Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval , 2008, ECAI.

[57]  Manuel Graña,et al.  A lattice computing approach to Alzheimer's disease computer assisted diagnosis based on MRI data , 2015, Neurocomputing.

[58]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[59]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[60]  Pat Langley,et al.  Constraints on Tree Structure in Concept Formation , 1991, IJCAI.

[61]  Stephen José Hanson,et al.  Conceptual Clustering, Categorization, and Polymorphy , 1989, Machine Learning.

[62]  Rui Ying Xu,et al.  Network Intrusion Detection Data Processing Research Based on Concept Clustering AOI Algorithm , 2014 .

[63]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[64]  Nicola Fanizzi,et al.  Evolutionary Conceptual Clustering of Semantically Annotated Resources , 2007 .

[65]  James Bailey,et al.  Contrast Data Mining: Concepts, Algorithms, and Applications , 2012 .

[66]  Chrisila C. Pettey,et al.  A hybrid conceptual clustering system , 1996, CSC '96.

[67]  Pat Langley,et al.  An integrated cognitive architecture for autonomous agents , 1989 .

[68]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[69]  Jie Zhao,et al.  A review of moving object trajectory clustering algorithms , 2016, Artificial Intelligence Review.

[70]  Bernhard Nordhausen,et al.  Conceptual Clustering Using Relational Information , 1986, AAAI.

[71]  Marzena Kryszkiewicz,et al.  Hierarchical Document Clustering Using Frequent Closed Sets , 2006, Intelligent Information Systems.

[72]  Yazdan Jamshidi,et al.  gsaINknn: A GSA optimized, lattice computing knn classifier , 2014, Eng. Appl. Artif. Intell..

[73]  Manuel Graña,et al.  Image Understanding Applications of Lattice Autoassociative Memories , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[74]  José Ruiz-Shulcloper,et al.  RGC: A new conceptual clustering algorithm for mixed incomplete data sets , 2002 .

[75]  Tu Bao Ho,et al.  An Approach to Concept Formation Based on Formal Concept Analysis , 1995, IEICE Trans. Inf. Syst..

[76]  Wesley W. Chu,et al.  An error-based conceptual clustering method for providing approximate query answers , 1996, CACM.

[77]  Siu Cheung Hui,et al.  A Fuzzy FCA-based Approach to Conceptual Clustering for Automatic Generation of Concept Hierarchy on Uncertainty Data , 2004, CLA.

[78]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[79]  Jean-Gabriel Ganascia,et al.  Default Clustering with Conceptual Structures , 2007, J. Data Semant..

[80]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[81]  Javier Béjar,et al.  Generality-Based Conceptual Clustering with Probabilistic Concepts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[82]  Rozaida Ghazali,et al.  A survey on bug prioritization , 2017, Artificial Intelligence Review.

[83]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[84]  Andrea Califano,et al.  Analysis of Gene Expression Microarrays for Phenotype Classification , 2000, ISMB.

[85]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[86]  Hans-Hellmut Nagel,et al.  Cognitive visual tracking and camera control , 2012, Comput. Vis. Image Underst..

[87]  Myung-Mook Han,et al.  Applying Genetic Algorithm to Conceptual Clustering , 1997 .

[88]  John R. Kender,et al.  Hierarchical document clustering using local patterns , 2010, Data Mining and Knowledge Discovery.

[89]  José Francisco Martínez Trinidad,et al.  LC: A Conceptual Clustering Algorithm , 2001, MLDM.

[90]  Céline Robardet,et al.  Comparison of Three Objective Functions for Conceptual Clustering , 2001, PKDD.

[91]  José Francisco Martínez Trinidad,et al.  Mining patterns for clustering using unsupervised decision trees , 2015, Intell. Data Anal..

[92]  Michael Lebowitz,et al.  Concept Learning in a Rich Input Domain: Generalization-Based Memory , 1984 .

[93]  Jian Pei,et al.  Clustering by Pattern Similarity , 2008, Journal of Computer Science and Technology.

[94]  M. Chein,et al.  Conceptual graphs: fundamental notions , 1992 .

[95]  Nuno A. Fonseca,et al.  Conceptual Clustering of Multi-Relational Data , 2011, ILP.

[96]  Jugal K. Kalita,et al.  Network Traffic Anomaly Detection and Prevention , 2017, Computer Communications and Networks.

[97]  George A. Papakostas,et al.  Lattice Computing Extension of the FAM Neural Classifier for Human Facial Expression Recognition , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[98]  Jean-Daniel Zucker,et al.  Abstractions for Knowledge Organization of Relational Descriptions , 2000, SARA.

[99]  Athena Vakali,et al.  Fuzzy lattice reasoning (FLR) type neural computation for weighted graph partitioning , 2009, Neurocomputing.

[100]  R. Tibshirani,et al.  Clustering methods for the analysis of DNA microarray data , 1999 .

[101]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[102]  José Francisco Martínez Trinidad,et al.  Mining patterns for clustering on numerical datasets using unsupervised decision trees , 2015, Knowl. Based Syst..

[103]  George A. Papakostas,et al.  Learning Distributions of Image Features by Interactive Fuzzy Lattice Reasoning in Pattern Recognition Applications , 2015, IEEE Computational Intelligence Magazine.

[104]  Yan Jia,et al.  Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database , 2005, WAIM.

[105]  Ping Wang,et al.  Weighted-spectral clustering algorithm for detecting community structures in complex networks , 2017, Artificial Intelligence Review.

[106]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[107]  Claudio Carpineto,et al.  GALOIS: An Order-Theoretic Approach to Conceptual Clustering , 1993, ICML.

[108]  Jean-Gabriel Ganascia,et al.  Accounting for Domain Knowledge in the Construction of a Generalization Space , 1997, ICCS.

[109]  Huiying Wang,et al.  Study on frequent term set-based hierarchical clustering algorithm , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[110]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[111]  M. Hadzikadic,et al.  Concept Formation by Incremental Conceptual Clustering , 1989, IJCAI.

[112]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.

[113]  Joel Scanlan,et al.  DynamicWEB: Adapting to Concept Drift and Object Drift in COBWEB , 2008, Australasian Conference on Artificial Intelligence.

[114]  Jugal Kalita,et al.  Network Traffic Anomaly Detection Techniques and Systems , 2017 .

[115]  Andreas Hotho,et al.  Conceptual Clustering of Text Clusters , 2003 .

[116]  Li Tian,et al.  A Hybrid Algorithm for Web Document Clustering Based on Frequent Term Sets and k-Means , 2007, APWeb/WAIM Workshops.

[117]  Analía Amandi,et al.  A conceptual clustering approach for user profiling in personal information agents , 2006, AI Commun..

[118]  Stephen Jose Hanson,et al.  Conceptual clustering, categorization, and polymorphy , 2004, Machine Learning.

[119]  Petra Perner,et al.  Fuzzy conceptual clustering , 2010, Qual. Reliab. Eng. Int..

[120]  Hao Chen,et al.  A Frequent Term-Based Multiple Clustering Approach for Text Documents , 2014, APWeb.

[121]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[122]  Jerry B. Weinberg,et al.  ITERATE: A Conceptual Clustering Method for Knowledge Discovery in Databases , 1994 .

[123]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[124]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[125]  Oscar Cordón,et al.  A Multiobjective Evolutionary Conceptual Clustering Methodology for Gene Annotation Within Structural Databases: A Case of Study on the Gene Ontology Database , 2008, IEEE Transactions on Evolutionary Computation.

[126]  Muhammad Younus Javed,et al.  A hierarchical k-means clustering based fingerprint quality classification , 2012, Neurocomputing.

[127]  John R. Kender,et al.  High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[128]  Steven J. Fenves,et al.  The formation and use of abstract concepts in design , 1991 .

[129]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[130]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[131]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[132]  Peter Sussner,et al.  Tunable equivalence fuzzy associative memories , 2016, Fuzzy Sets Syst..