Clustering based semantic data summarization technique: A new approach

Due to advancement of computing and proliferation of data repositories, efficient data mining techniques are required to extract meaningful information. Summarization is such an important data analysis technique which can be broadly classified into two categories as semantic and syntactic methods. Syntactic methods consider a dataset as a sequence of bytes whereas semantic methods convert large dataset into a much smaller one yet maintaining low information loss. Clustering algorithms are widely used for semantic summarization such as basic k-means. Existing clustering based summarization techniques assume that a summary is represented using the cluster centroids. However, the centroids might not represent the actual data points in summary. In addition, many clustering algorithms, such as the most popular k-means algorithm requires the number of clusters as an input, which is not available for unsupervised summarization of unlabeled data. To address these issues, we propose a clustering based semantic summarization using a combination of x-means and k-medoid clustering algorithms. Our experimental analysis shows that, the proposed algorithm outperforms k-means based summarization techniques.

[1]  Vipin Kumar,et al.  Summarization – compressing data into an informative representation , 2006, Knowledge and Information Systems.

[2]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[3]  Alfred E. Brenner,et al.  Moore's Law , 1997, Science.

[4]  Padmini Srinivasan,et al.  A quality-threshold data summarization algorithm , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[5]  Zahir Tari,et al.  Data summarization for network traffic monitoring , 2014, J. Netw. Comput. Appl..

[6]  Rebecca Castano,et al.  Semi-Supervised Data Summarization: Using Spectral Libraries to Improve Hyperspectral Clustering , 2005 .

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Lawrence O. Hall,et al.  Scalable clustering: a distributed approach , 2004, 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542).