Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

The most common way to organize and label documents is to group similar documents into clusters. Normally, the assumed number of clusters may be unreliable since the nature of the grouping structures among the data is unknown before processing and thus the partitioning methods would not predict the structures of the data very well. Hierarchical clustering has been chosen to solve this problem by which they provide data-views at different levels of abstraction, making them ideal for people to visualize the concepts generated and interactively explore large document collections. The appropriate method of combining two different clusters to form a single cluster needs affects the quality of clusters produced. In order to perform this task, various distance methods will be studied in order to cluster documents by using the hierarchical agglomerative clustering. Clusters very often include sub-clusters, and the hierarchical structure is indeed a natural constraint on the underlying application domain. In order to manage and organize documents effectively, similar documents will be merged to form clusters. Each document is represented by one or more concepts. In this paper, concepts that characterize English documents will be generated by using the hierarchical agglomerative clustering. One of the advantages of using hierarchical clustering is that the overlapping clusters can be formed and concepts can be generated based on the contents of each cluster. The quality of clusters produced is also investigated by using different distance measures.

[1]  Jean-Charles Lamirel,et al.  Novel labeling strategies for hierarchical representation of multidimensional data analysis results , 2008 .

[2]  Xiaojun Wan,et al.  CollabSum: exploiting multiple document clustering for collaborative single document summarizations , 2007, SIGIR.

[3]  James P. Callan,et al.  An experimental study on automatically labeling hierarchical clusters using statistical features , 2006, SIGIR.

[4]  Jian-Hui Jiang,et al.  Bubble agglomeration algorithm for unsupervised classification: a new clustering methodology without a priori information , 2005 .

[5]  Fabio Stella,et al.  Automatic Labeling of Topics , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[6]  Reynaldo Gil-García,et al.  Dynamic hierarchical algorithms for document clustering , 2010, Pattern Recognit. Lett..

[7]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Madjid Khalilian,et al.  Data Stream Clustering: Challenges and Issues , 2010, ArXiv.

[11]  Timothy Baldwin,et al.  Visualizing search results and document collections using topic maps , 2010, J. Web Semant..

[12]  Andrew H. Sung,et al.  A Similarity Measure for Clustering and its Applications , 2008 .

[13]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Ignazio Gallo,et al.  An online document clustering technique for short web contents , 2009, Pattern Recognit. Lett..

[16]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.