A Semi-supervised approach to Document Clustering with Sequence Constraints

Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.

[1]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[2]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[3]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[6]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[8]  Qing He,et al.  Effective semi-supervised document clustering via active learning with instance-level constraints , 2011, Knowledge and Information Systems.

[9]  Partha Dasgupta,et al.  Topology of the conceptual network of language. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[11]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[12]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[13]  Otto Jespersen,et al.  The Philosophy of Grammar , 1924 .

[14]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[15]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[16]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[17]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[20]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[21]  M. Eduardo Ares,et al.  Constrained Text Clustering Using Word Trigrams , 2012 .

[22]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[23]  M. Rafi,et al.  A comparison of two suffix tree-based document clustering algorithms , 2010, 2010 International Conference on Information and Emerging Technologies.