Supervised Clustering with Structural SVMS

Supervised clustering is the problem of training clustering methods to produce desirable clusterings. Given sets of items and complete clusterings over these sets, a supervised clustering algorithm learns how to cluster future sets of items in a similar fashion, typically by changing the underlying similarity measure between item pairs. This work presents a general approach for training clustering methods such as correlation clustering and k-means/spectral clustering able to optimize to task-specific performance criteria using structural SVMs. We empirically and theoretically analyze our supervised clustering approach on a variety of datasets and clustering methods. This analysis also leads to general insights about structural SVMs beyond supervised clustering. Specifically, since clustering is a NP-hard task and the corresponding training problem likewise must make use of approximate inference during training of the parameters, we present a detailed theoretical and empirical analysis of the general use of approximations in structural SVM training.

[1]  Thorsten Joachims,et al.  Support Vector Training of Protein Alignment Models , 2007, RECOMB.

[2]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[3]  Martial Hebert,et al.  Discriminative Fields for Modeling Spatial Dependencies in Natural Images , 2003, NIPS.

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  Surajit Ray,et al.  A Nonparametric Statistical Approach to Clustering via Mode Identification , 2007, J. Mach. Learn. Res..

[6]  Ivor W. Tsang,et al.  Distance metric learning with kernels , 2003 .

[7]  Philippe Rigollet,et al.  Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption , 2006, J. Mach. Learn. Res..

[8]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[9]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[10]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[11]  Thorsten Joachims,et al.  Supervised k-Means Clustering , 2008 .

[12]  M. Seeger Learning with labeled and unlabeled dataMatthias , 2001 .

[13]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[14]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[15]  Dan Roth,et al.  Learning and Inference over Constrained Output , 2005, IJCAI.

[16]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[17]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[19]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[20]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[21]  I. Dhillon,et al.  A Unified View of Kernel k-means , Spectral Clustering and Graph Cuts , 2004 .

[22]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[23]  Rahul Gupta,et al.  Accurate max-margin training for structured output spaces , 2008, ICML '08.

[24]  Thorsten Joachims,et al.  Learning to Align Sequences: A Maximum-Margin Approach , 2006 .

[25]  Gurmeet Singh,et al.  MRF's forMRI's: Bayesian Reconstruction of MR Images via Graph Cuts , 2006, CVPR.

[26]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[27]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[28]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[29]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[30]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[31]  Thorsten Joachims,et al.  Training structural svms with kernels using sampled cuts , 2008, KDD.

[32]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[33]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[34]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[35]  Thorsten Joachims,et al.  Supervised clustering with support vector machines , 2005, ICML.

[36]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[37]  Daniel Marcu,et al.  Practical structured learning techniques for natural language processing , 2006 .

[38]  Ben Taskar,et al.  Word Alignment via Quadratic Assignment , 2006, NAACL.

[39]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[40]  Nicu Sebe,et al.  Semi-supervised learning for facial expression recognition , 2003, MIR '03.

[41]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[42]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[43]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[44]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[45]  Andrew McCallum,et al.  Fast, Piecewise Training for Discriminative Finite-state and Parsing Models , 2005 .

[46]  Toshihiro Kamishima,et al.  Learning from Cluster Examples , 2003, Machine Learning.

[47]  Ben Taskar,et al.  Learning associative Markov networks , 2004, ICML.

[48]  Dan Roth Reasoning with Classifiers , 2002, PKDD.

[49]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Dan Roth,et al.  Probabilistic Reasoning for Entity & Relation Recognition , 2002, COLING.

[51]  Endre Boros,et al.  Pseudo-Boolean optimization , 2002, Discret. Appl. Math..

[52]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[53]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[54]  Thorsten Joachims,et al.  Error bounds for correlation clustering , 2005, ICML.

[55]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[56]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[57]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[58]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[59]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[60]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[61]  Claire Cardie,et al.  Noun Phrase Coreference as Clustering , 1999, EMNLP.

[62]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[63]  Vladimir Kolmogorov,et al.  Minimizing Nonsubmodular Functions with Graph Cuts-A Review , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[65]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[66]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[67]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[68]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[69]  Daniel Marcu,et al.  A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior , 2005, J. Mach. Learn. Res..

[70]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[71]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[72]  Jon M Kleinberg,et al.  Hubs, authorities, and communities , 1999, CSUR.

[73]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[74]  Bogdan Gabrys,et al.  Combining labelled and unlabelled data in the design of pattern classification systems , 2004, Int. J. Approx. Reason..

[75]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[76]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[77]  Peter Haider,et al.  Supervised clustering of streaming data for email batch detection , 2007, ICML '07.

[78]  Claudio Gentile,et al.  Hierarchical classification: combining Bayes with SVM , 2006, ICML.

[79]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[80]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[81]  Dan Roth,et al.  The Necessity of Syntactic Parsing for Semantic Role Labeling , 2005, IJCAI.

[82]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[83]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[84]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[85]  Andrew McCallum,et al.  First-Order Probabilistic Models for Coreference Resolution , 2007, NAACL.

[86]  Nello Cristianini,et al.  Efficiently Learning the Metric with Side-Information , 2003, ALT.

[87]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[88]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[89]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[90]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[91]  Ulf Brefeld,et al.  Semi-supervised learning for structured output variables , 2006, ICML.

[92]  G. Rota The Number of Partitions of a Set , 1964 .

[93]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[94]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[95]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[96]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[97]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[98]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[99]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[100]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[101]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[102]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[103]  Thorsten Joachims,et al.  Predicting diverse subsets using structural SVMs , 2008, ICML '08.

[104]  Yuan Qi,et al.  Bayesian Conditional Random Fields , 2005, AISTATS.

[105]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[106]  Pierre Hansen,et al.  Roof duality, complementation and persistency in quadratic 0–1 optimization , 1984, Math. Program..