Exploiting tag and word correlations for improved webpage clustering

Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon and Delicious, has led to a huge amount of user-generated content such as the tag information that is associated with the webpages. In this paper, we present a subspace based feature extraction approach which leverages tag information to complement the page-contents of a webpage to extract highly discriminative features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. Although our results here are on the webpage clustering task, the same approach can be used for webpage classification as well. In the end, we also suggest possible future work for leveraging tag information in webpage clustering, especially when tag information is present for not all, but only for a small number of webpages.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[3]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[4]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[5]  John Shawe-Taylor,et al.  A Correlation Approach for Automatic Image Annotation , 2006, ADMA.

[6]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[7]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[8]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[9]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[10]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[11]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[12]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[13]  Dinan Gunawardena,et al.  Social tags: meaning and suggestions , 2008, CIKM '08.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Indrayana Rustandi,et al.  Integrating Multiple-Study Multiple-Subject fMRI Datasets Using Canonical Correlation Analysis , 2009 .

[17]  David R. Hardoon,et al.  KCCA for different level precision in content-based image retrieval , 2003 .

[18]  Hal Daumé,et al.  Multi-Label Prediction via Sparse Infinite CCA , 2009, NIPS.

[19]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[20]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[21]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[22]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[23]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[24]  Tong Zhang,et al.  Two-view feature generation model for semi-supervised learning , 2007, ICML '07.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Sham M. Kakade,et al.  Multi-view Regression Via Canonical Correlation Analysis , 2007, COLT.

[27]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[28]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[29]  Arkaitz Zubiaga,et al.  Getting the most out of social annotations for web page classification , 2009, DocEng '09.

[30]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Johan A. K. Suykens,et al.  Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines , 2001, ICANN.

[32]  Jieping Ye,et al.  Extracting shared subspace for multi-label classification , 2008, KDD.

[33]  Xin Chen,et al.  Exploit the tripartite network of social tagging for web clustering , 2009, CIKM.

[34]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[35]  Christoph H. Lampert,et al.  Semi-supervised Laplacian Regularization of Kernel Canonical Correlation Analysis , 2008, ECML/PKDD.

[36]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[37]  Dean P. Foster Multi-View Dimensionality Reduction via Canonical Correlation Multi-View Dimensionality Reduction via Canonical Correlation Analysis Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimen , 2008 .

[38]  Vladimir Pavlovic,et al.  Covariance Operator Based Dimensionality Reduction with Extension to Semi-Supervised Settings , 2009, AISTATS.

[39]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[40]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[41]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Clustering via the SocialWeb , 2009, ACL.

[42]  Tom M. Mitchell,et al.  Learning to Tag from Open Vocabulary Labels , 2010, ECML/PKDD.

[43]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.