A Hybrid Strategy for Clustering Data Mining Documents

With the increase in the number of electronic documents, it is hard to manually organize, analyze and present these documents efficiently. Document clustering, which automatically groups similar or related documents together, has been used in practical applications to understand the contents and structures of documents. Although a variety of methods and algorithms have been proposed, it is still a challenging task to generate meaningful document clusters. This paper uses an approach that combines quantitative and qualitative methods in order to create high-quality clusters for a collection of data mining and knowledge discovery (DMKD) publications. The quantitative method extracts a list of noun/noun phrases from the DMKD documents and uses an optimization procedure from CLUTO toolkit to assign documents to clusters. The qualitative method uses grounded theory to identify major categories of the documents to improve the comprehensibility of resultant clusters. The results demonstrate that the strategy produces more meaningful clusters than single-term k-way clustering algorithm in terms of internal metrics and human assessment