F-tree: an algorithm for clustering transactional data using frequency tree

Clustering is an important data mining technique that groups similar data records, recently categorical transaction clustering is received more attention. In this research, we study the problem of categorical data clustering for transactional data characterized with high dimensionality and large volume. We propose a novel algorithm for clustering transactional data called F-Tree, which is based on the idea of the frequent pattern algorithm FP-tree; the fastest approaches to the frequent item set mining. And the simple idea behind the F-Tree is to generate small high pure clusters, and then merge them. That makes it fast, and dynamic in clustering large transactional datasets with high dimensions. We also present a new solution to solve the overlapping problem between clusters, by defining a new criterion function, which is based on the probability of overlapping between weighted items. Our experimental evaluation on real datasets shows that: Firstly, F-Tree is effective in finding interesting clusters. Secondly, the usage of the tree structure reduces the clustering process time of the large data set with high attributes. Thirdly, the proposed evaluation metric used efficiently to solve the overlapping of transaction items generates high-quality clustering results. Finally, we have concluded that the process of merging pure and small clusters increases the purity of resulted clusters as well as it reduces the time of clustering better than the process of generating clusters directly from dataset then refine clusters.

[1]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[2]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[3]  Alexandre Villeminot,et al.  Combined use of association rules mining and clustering methods to find relevant links between binary rare attributes in a large data set , 2007, Comput. Stat. Data Anal..

[4]  Subhash Sharma Applied multivariate techniques , 1995 .

[5]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[6]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[7]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[8]  Ming-Syan Chen,et al.  An efficient clustering algorithm for market basket data based on small large ratios , 2001, 25th Annual International Computer Software and Applications Conference. COMPSAC 2001.

[9]  Yun Sing Koh,et al.  Transaction Clustering Using a Seeds Based Approach , 2008, PAKDD.

[10]  Hui Xiong,et al.  A New Clustering Algorithm for Transaction Data via Caucus , 2003, PAKDD.

[11]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[12]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[13]  Yun Sing Koh,et al.  Rare Association Rule Mining via Transaction Clustering , 2008, AusDM.