Discovering Large Dense Subgraphs in Massive Graphs

We present a new algorithm for finding large, dense subgraphs in massive graphs. Our algorithm is based on a recursive application of fingerprinting via shingles, and is extremely efficient, capable of handling graphs with tens of billions of edges on a single machine with modest resources.We apply our algorithm to characterize the large, dense subgraphs of a graph showing connections between hosts on the World Wide Web; this graph contains over 50M hosts and 11B edges, gathered from 2.1B web pages. We measure the distribution of these dense subgraphs and their evolution over time. We show that more than half of these hosts participate in some dense subgraph found by the analysis. There are several hundred giant dense subgraphs of at least ten thousand hosts; two thousand dense subgraphs at least a thousand hosts; and almost 64K dense subgraphs of at least a hundred hosts.Upon examination, many of the dense subgraphs output by our algorithm are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. We therefore propose dense subgraph extraction as a useful primitive for spam detection, and discuss its incorporation into the workflow of web search engines.

[1]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[2]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[5]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[6]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[7]  Mayur Datar,et al.  On the streaming model augmented with a sorting primitive , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[9]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[10]  Sandra Sudarsky,et al.  Massive Quasi-Clique Detection , 2002, LATIN.

[11]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[12]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[13]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[14]  Ravi Kumar,et al.  A graph-theoretic approach to extract storylines from search results , 2004, KDD.

[15]  David Gibson Surfing the web by site , 2004, WWW Alt. '04.

[16]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[19]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[20]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[21]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[22]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[23]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[24]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[25]  David Gibson The site browser: catalyzing improvements in hypertext organization , 2004, HYPERTEXT '04.