Constrained Text Clustering Using Word Trigrams

In recent years there has emerged the field of Constrained Clustering, which proposes clustering algorithms which are able to accommodate domain information to obtain a better final grouping. This information is usually provided as pairwise constraints, whose acquisition from humans can be costly. In this paper we propose a novel method based on word n-grams to automatically extract positive constraints from text collections. Clustering experiments in text collections composed by different types of documents show that the constraints created with our method attain statistically significant improvements over the results obtained with constraints created using named entities and over the results of a high-performing non-constrained algorithm.

[1]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[2]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[3]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[6]  Alvaro Barreiro,et al.  An experimental study of constrained clustering effectiveness in presence of erroneous constraints , 2012, Inf. Process. Manag..

[7]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[8]  L. Hubert,et al.  Comparing partitions , 1985 .

[9]  Derek Greene,et al.  Constraint Selection by Committee: An Ensemble Approach to Identifying Informative Constraints for Semi-supervised Clustering , 2007, ECML.

[10]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[11]  Alvaro Barreiro,et al.  Improving Text Clustering with Social Tagging , 2011, ICWSM.

[12]  Furu Wei,et al.  Constrained Coclustering for Textual Documents , 2010, AAAI Conference on Artificial Intelligence.

[13]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[14]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.