论文信息 - Interactive Clustering of Text Collections According to a User-Specified Criterion

Interactive Clustering of Text Collections According to a User-Specified Criterion

Document clustering is traditionally tackled from the perspective of grouping documents that are topically similar. However, many other criteria for clustering documents can be considered: for example, documents' genre or the author's mood. We propose an interactive scheme for clustering document collections, based on any criterion of the user's preference. The user holds an active position in the clustering process: first, she chooses the types of features suitable to the underlying task, leading to a task-specific document representation. She can then provide examples of features-- if such examples are emerging, e.g., when clustering by the author's sentiment, words like 'perfect', 'mediocre', 'awful' are intuitively good features. The algorithm proceeds iteratively, and the user can fix errors made by the clustering system at the end of each iteration. Such an interactive clustering method demonstrates excellent results on clustering by sentiment, substantially outperforming an SVM trained on a large amount of labeled data. Even if features are not provided because they are not intuitively obvious to the user--e.g., what would be good features for clustering by genre using part-of-speech trigrams?--our multi-modal clustering method performs significantly better than k-means and Latent Dirichlet Allocation (LDA).

James Allan | Hema Raghavan | Koji Eguchi | Ron Bekkerman

[1] Hema Raghavan,et al. InterActive Feature Selection , 2005, IJCAI.

[2] Andrew McCallum,et al. Topic and Role Discovery in Social Networks , 2005, IJCAI.

[3] Raymond J. Mooney,et al. A probabilistic framework for semi-supervised clustering , 2004, KDD.

[4] Koji Eguchi,et al. Sentiment Retrieval using Generative Models , 2006, EMNLP.

[5] Ran El-Yaniv,et al. Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[6] J. Besag. On the Statistical Analysis of Dirty Pictures , 1986 .

[7] David Y. W. Lee,et al. Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[8] Peter Willett,et al. Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[9] Bo Pang,et al. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[10] Philip S. Yu,et al. Text Classification by Labeling Words , 2004, AAAI.

[11] Peter D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[12] David Madigan,et al. Constructing informative prior distributions from domain knowledge in text classification , 2006, SIGIR.

[13] Ron Bekkerman,et al. Semi-supervised Clustering using Combinatorial MRFs , 2006 .