A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.

[1]  Günther Palm,et al.  On the Effects of Constraints in Semi-supervised Hierarchical Clustering , 2006, ANNPR.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[4]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[5]  Ricardo J. G. B. Campello,et al.  Automatic aspect discrimination in data clustering , 2012, Pattern Recognit..

[6]  Haim Levkowitz,et al.  Least Square Projection: A Fast High-Precision Multidimensional Projection Technique and Its Application to Document Mapping , 2008, IEEE Transactions on Visualization and Computer Graphics.

[7]  Hans-Peter Kriegel,et al.  Density‐based clustering , 2011, WIREs Data Mining Knowl. Discov..

[8]  Tengke Xiong,et al.  Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data , 2011, PAKDD.

[9]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[10]  Werner Stuetzle,et al.  Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample , 2003, J. Classif..

[11]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[12]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[13]  Haifeng Zhao,et al.  Hierarchical Agglomerative Clustering with Ordering Constraints , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[14]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[15]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[16]  Hans-Peter Kriegel,et al.  Visually Mining through Cluster Hierarchies , 2004, SDM.

[17]  Christian Böhm,et al.  HISSCLU: a hierarchical density-based method for semi-supervised clustering , 2008, EDBT '08.

[18]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[19]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[20]  Evelina Lamma,et al.  Automatic Cluster Selection Using Index Driven Search Strategy , 2009, AI*IA.

[21]  S. S. Ravi,et al.  Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.

[22]  Andreas Nürnberger,et al.  Creating a Cluster Hierarchy under Constraints of a Partially Known Hierarchy , 2008, SDM.

[23]  Mohammed Benkhalifa,et al.  Integrating WordNet knowledge to supplement training data in semi‐supervised agglomerative hierarchical clustering for text categorization , 2001, Int. J. Intell. Syst..

[24]  Jon R. Kettenring,et al.  The Practice of Cluster Analysis , 2006, J. Classif..

[25]  Claire Cardie,et al.  Intelligent Clustering with Instance-Level Constraints , 2002 .

[26]  A. Bensaid,et al.  Data mining for text categorization with semi‐supervised agglomerative hierarchical clustering , 2000 .

[27]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[28]  Rainer Alt,et al.  IEEE/WIC/ACM International Conference on Web Intelligence , 2015, WI-IAT.

[29]  Georges Hébrail,et al.  Interactive Interpretation of Hierarchical Clustering , 1997, Intell. Data Anal..

[30]  Sadaaki Miyamoto,et al.  Semi-supervised agglomerative hierarchical clustering algorithms with pairwise constraints , 2010, International Conference on Fuzzy Systems.

[31]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[32]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[33]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[34]  Tao Li,et al.  Semi-supervised Hierarchical Clustering , 2011, 2011 IEEE 11th International Conference on Data Mining.

[35]  Saso Dzeroski,et al.  Clustering Trees with Instance Level Constraints , 2007, ECML.

[36]  Michel Herbin,et al.  Estimation of the number of clusters and influence zones , 2001, Pattern Recognit. Lett..

[37]  Günther Palm,et al.  On the robustness of semi-supervised hierarchical graph clustering in functional genomics , 2007 .

[38]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[39]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[40]  L. Hubert,et al.  Comparing partitions , 1985 .

[41]  Joydeep Ghosh,et al.  Hierarchical Density Shaving: A clustering and visualization framework for large biological datasets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[42]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[43]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[44]  Sadaaki Miyamoto,et al.  On Agglomerative Hierarchical Clustering Using Clusterwise Tolerance Based Pairwise Constraints , 2012, J. Adv. Comput. Intell. Intell. Informatics.

[45]  Georges Hébrail,et al.  Interactive Interpretation of Hierarchical Clustering , 1998, Intell. Data Anal..

[46]  Joydeep Ghosh,et al.  Automated Hierarchical Density Shaving: A Robust Automated Clustering and Visualization Framework for Large Biological Data Sets , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  B. Everitt,et al.  Cluster Analysis: Low Temperatures and Voting in Congress , 2001 .

[48]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[49]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[50]  Sang-goo Lee,et al.  An effective document clustering method using user-adaptable distance metrics , 2002, SAC '02.

[51]  Faïez Gargouri,et al.  $\mathcal{SHACUN}$ : Semi-supervised Hierarchical Active Clustering Based on Ranking Constraints , 2012, ICDM.

[52]  Andreas Nürnberger,et al.  Personalized Hierarchical Clustering , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[53]  Andreas Nürnberger,et al.  User Oriented Hierarchical Information Organization and Retrieval , 2007, ECML.

[54]  Jörg Sander,et al.  Semi-supervised Density-Based Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[55]  Jiawei Han,et al.  gSkeletonClu: Density-Based Network Clustering via Structure-Connected Tree Division or Agglomeration , 2010, 2010 IEEE International Conference on Data Mining.

[56]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Efficiency issues of evolutionary k-means , 2011, Appl. Soft Comput..

[57]  Ian Davidson,et al.  Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach , 2011, KDD.

[58]  W. Stuetzle,et al.  A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density , 2010 .