On clustering heterogeneous social media objects with outlier links

The clustering of social media objects provides intrinsic understanding of the similarity relationships between documents, images, and their contextual sources. Both content and link structure provide important cues for an effective clustering algorithm of the underlying objects. While link information provides useful hints for improving the clustering process, it also contains a significant amount of noisy information. Therefore, a robust clustering algorithm is required to reduce the impact of noisy links. In order to address the aforementioned problems, we propose heterogeneous random fields to model the structure and content of social media networks. We design a probability measure on the social media networks which output a configuration of clusters that are consistent with both content and link structure. Furthermore, noisy links can also be detected, and their impact on the clustering algorithm can be significantly reduced. We conduct experiments on a real social media network and show the advantage of the method over other state-of-the-art algorithms.

[1]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[2]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[3]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[4]  Tao Qin,et al.  Web image clustering by consistent utilization of visual features and surrounding texts , 2005, MULTIMEDIA '05.

[5]  Yun Chi,et al.  Combining link and content for community detection: a discriminative approach , 2009, KDD.

[6]  Jing Hua,et al.  Graph theoretical framework for simultaneously integrating visual and textual features for efficient web image clustering , 2008, WWW.

[7]  Guoping Qiu,et al.  Image and feature co-clustering , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[10]  Guoping Qiu Image and feature co-clustering , 2004, ICPR 2004.

[11]  Jimeng Sun,et al.  Extracting community structure through relational hypergraphs , 2009, WWW '09.

[12]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[13]  Yixin Chen,et al.  CLUE: cluster-based retrieval of images by unsupervised learning , 2005, IEEE Transactions on Image Processing.

[14]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.

[15]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[16]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[17]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[18]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[19]  Shiri Gordon,et al.  Unsupervised image-set clustering using an information theoretic framework , 2006, IEEE Transactions on Image Processing.

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[22]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.