Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering

Automatic clustering of Web pages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, Web page clustering algorithms use only features extracted from the page-text. However, the advent of social-bookmarking Web sites, such as StumbleUpon.com and Delicious.com, has led to a huge amount of user-generated content such as the social tag information that is associated with the Web pages. In this article, we present a subspace based feature extraction approach that leverages the social tag information to complement the page-contents of a Web page for extracting beter features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We then present an extension that allows our approach to be applicable even if the Web page corpus is only partially tagged, that is, when the social tags are present for not all, but only for a small number of Web pages. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the Web page clustering task. We also discuss some possible future work including an active learning extension that can help in choosing which Web pages to get tags for, if we only can get the social tags for only a small number of Web pages.

[1]  Tong Zhang,et al.  Two-view feature generation model for semi-supervised learning , 2007, ICML '07.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Johan A. K. Suykens,et al.  Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines , 2001, ICANN.

[4]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[5]  Xin Chen,et al.  Exploit the tripartite network of social tagging for web clustering , 2009, CIKM.

[6]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[7]  Sham M. Kakade,et al.  Multi-view Regression Via Canonical Correlation Analysis , 2007, COLT.

[8]  Miguel Á. Carreira-Perpiñán,et al.  The Laplacian Eigenmaps Latent Variable Model , 2007, AISTATS.

[9]  Indrayana Rustandi,et al.  Integrating Multiple-Study Multiple-Subject fMRI Datasets Using Canonical Correlation Analysis , 2009 .

[10]  Beatrice Alex,et al.  Proceedings of the International Conference on Machine Learning (ICML-2005) Workshop on Learning with Multiple Views , 2005 .

[11]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[12]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[13]  V. D. Sa Spectral Clustering with Two Views , 2007 .

[14]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[15]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[16]  Dean P. Foster Multi-View Dimensionality Reduction via Canonical Correlation Multi-View Dimensionality Reduction via Canonical Correlation Analysis Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimen , 2008 .

[17]  Arkaitz Zubiaga,et al.  Getting the most out of social annotations for web page classification , 2009, DocEng '09.

[18]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Tom M. Mitchell,et al.  Learning to Tag from Open Vocabulary Labels , 2010, ECML/PKDD.

[20]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[21]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[22]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[23]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[24]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[25]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[26]  Kim-Chuan Toh,et al.  A Newton-CG Augmented Lagrangian Method for Semidefinite Programming , 2010, SIAM J. Optim..

[27]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[28]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[30]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[31]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[32]  Jieping Ye,et al.  Extracting shared subspace for multi-label classification , 2008, KDD.

[33]  Dinan Gunawardena,et al.  Social tags: meaning and suggestions , 2008, CIKM '08.

[34]  John Shawe-Taylor,et al.  A Correlation Approach for Automatic Image Annotation , 2006, ADMA.

[35]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[36]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[37]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[38]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[39]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[40]  Vladimir Pavlovic,et al.  Covariance Operator Based Dimensionality Reduction with Extension to Semi-Supervised Settings , 2009, AISTATS.

[41]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[42]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Christoph H. Lampert,et al.  Semi-supervised Laplacian Regularization of Kernel Canonical Correlation Analysis , 2008, ECML/PKDD.

[44]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .

[45]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[46]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[47]  David R. Hardoon,et al.  KCCA for different level precision in content-based image retrieval , 2003 .

[48]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[49]  Hal Daumé,et al.  Multi-Label Prediction via Sparse Infinite CCA , 2009, NIPS.

[50]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.