Which Clustering Do You Want? Inducing Your Ideal Clustering with Minimal Feedback

While traditional research on text clustering has largely focused on grouping documents by topic, it is conceivable that a user may want to cluster documents along other dimensions, such as the author's mood, gender, age, or sentiment. Without knowing the user's intention, a clustering algorithm will only group documents along the most prominent dimension, which may not be the one the user desires. To address the problem of clustering documents along the user-desired dimension, previous work has focused on learning a similarity metric from data manually annotated with the user's intention or having a human construct a feature space in an interactive manner during the clustering process. With the goal of reducing reliance on human knowledge for fine-tuning the similarity function or selecting the relevant features required by these approaches, we propose a novel active clustering algorithm, which allows a user to easily select the dimension along which she wants to cluster the documents by inspecting only a small number of words. We demonstrate the viability of our algorithm on a variety of commonly-used sentiment datasets.

[1]  ChengXiang Zhai,et al.  A two-stage approach to domain adaptation for statistical classifiers , 2007, CIKM '07.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Claire Cardie,et al.  Learning with Compositional Semantics as Structural Inference for Subsentential Sentiment Analysis , 2008, EMNLP.

[4]  Hwee Tou Ng,et al.  Domain Adaptation with Active Learning for Word Sense Disambiguation , 2007, ACL.

[5]  Gil-Chang Kim,et al.  Multiple sets of features for automatic genre classification of web documents , 2005, Inf. Process. Manag..

[6]  Mike Wells,et al.  Structured Models for Fine-to-Coarse Sentiment Analysis , 2007, ACL.

[7]  Inderjit S. Dhillon,et al.  Simultaneous Unsupervised Learning of Disparate Clusterings , 2008, Stat. Anal. Data Min..

[8]  Jun Guo,et al.  Locality discriminating indexing for document classification , 2007, SIGIR.

[9]  Douglas Biber,et al.  Towards a taxonomy of web registers and text types: a multi-dimensional analysis , 2007 .

[10]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[11]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[12]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[13]  Alistair Kennedy,et al.  SENTIMENT CLASSIFICATION of MOVIE REVIEWS USING CONTEXTUAL VALENCE SHIFTERS , 2006, Comput. Intell..

[14]  George Tambouratzis,et al.  Employing Thematic Variables for Enhancing Classification Accuracy Within Author Discrimination Experiments , 2007, Lit. Linguistic Comput..

[15]  Daniel Jurafsky,et al.  Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation , 2009, NAACL.

[16]  Vincent Ng,et al.  Mining Clustering Dimensions , 2010, ICML.

[17]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[18]  Delip Rao,et al.  Semi-Supervised Polarity Lexicon Induction , 2009, EACL.

[19]  Martine D. F. Schlag,et al.  Spectral K-Way Ratio-Cut Partitioning and Clustering , 1993, 30th ACM/IEEE Design Automation Conference.

[20]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[21]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[22]  Annie Zaenen,et al.  Contextual Valence Shifters , 2006, Computing Attitude and Affect in Text.

[23]  James Allan,et al.  An interactive algorithm for asking and incorporating feature feedback into support vector machines , 2007, SIGIR.

[24]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  M. Kugler,et al.  Feature subset selection for support vector machines using confident margin , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[26]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[27]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[30]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[31]  Qiang Yang,et al.  Spectral domain-transfer learning , 2008, KDD.

[32]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[33]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[34]  Aidan Finn,et al.  Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style , 2006 .

[35]  Vincent Ng,et al.  Towards subjectifying text clustering , 2010, SIGIR.

[36]  Dan Roth,et al.  Interactive Feature Space Construction using Semantic Information , 2009, CoNLL.

[37]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[38]  Tao Li,et al.  A Non-negative Matrix Tri-factorization Approach to Sentiment Classification with Lexical Prior Knowledge , 2009, ACL.

[39]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[40]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[41]  Mark Sandler,et al.  On the use of linear programming for unsupervised text classification , 2005, KDD '05.

[42]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[43]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[44]  Vikas Sindhwani,et al.  Document-Word Co-regularization for Semi-supervised Sentiment Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[45]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[46]  Stefanie Wulff,et al.  Corpus-linguistic applications , 2010 .

[47]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[48]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[49]  Hongbo Xu,et al.  Adapting Naive Bayes to Domain Adaptation for Sentiment Analysis , 2009, ECIR.

[50]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[51]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[52]  Ian Davidson,et al.  Finding Alternative Clusterings Using Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[53]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[54]  Vincent Ng,et al.  Topic-wise, Sentiment-wise, or Otherwise? Identifying the Hidden Dimension for Unsupervised Text Classification , 2009, EMNLP.

[55]  Angus B. Grieve-Smith The Envelope of Variation in Multidimensional Register and Genre Analyses , 2007 .

[56]  Saif Mohammad,et al.  Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus , 2009, EMNLP.

[57]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[58]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[59]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[60]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[61]  James Allan,et al.  Interactive Clustering of Text Collections According to a User-Specified Criterion , 2007, IJCAI.

[62]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[63]  Charles A. Micchelli,et al.  On Spectral Learning , 2010, J. Mach. Learn. Res..

[64]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[65]  Vincent Ng,et al.  Mine the Easy, Classify the Hard: A Semi-Supervised Approach to Automatic Sentiment Classification , 2009, ACL.

[66]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[67]  Sabine Bergler,et al.  Mining WordNet for a Fuzzy Sentiment: Sentiment Tag Extraction from WordNet Glosses , 2006, EACL.

[68]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[69]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[70]  Shlomo Argamon,et al.  The effect of OCR errors on stylistic text classification , 2006, SIGIR '06.

[71]  Thomas Hofmann,et al.  Non-redundant data clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[72]  Stefanie Wulff,et al.  Corpus-linguistic applications : current studies, new directions , 2010 .

[73]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[74]  Xiaojun Wan,et al.  Using Bilingual Knowledge and Ensemble Techniques for Unsupervised Chinese Sentiment Analysis , 2008, EMNLP.

[75]  Andrew McCallum,et al.  Active Learning by Labeling Features , 2009, EMNLP.

[76]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[77]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[78]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[79]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[80]  David Yarowsky,et al.  Modeling Latent Biographic Attributes in Conversational Genres , 2009, ACL.

[81]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[82]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[83]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[84]  Glenn Fung,et al.  The disputed federalist papers: SVM feature selection via concave minimization , 2003, TAPIA '03.

[85]  Rada Mihalcea,et al.  Multilingual Subjectivity Analysis Using Machine Translation , 2008, EMNLP.

[86]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[87]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.