Compact representation of Web graphs with extended functionality

The representation of large subsets of the World Wide Web in the form of a directed graph has been extensively used to analyze structure, behavior, and evolution of those so-called Web graphs. However, interesting Web graphs are very large and their classical representations do not fit into the main memory of typical computers, whereas the required graph algorithms perform inefficiently on secondary memory. Compressed graph representations drastically reduce their space requirements while allowing their efficient navigation in compressed form. While the most basic navigation operation is to retrieve the successors of a node, several important Web graph algorithms require support for extended queries, such as finding the predecessors of a node, checking the presence of a link, or retrieving links between ranges of nodes. Those are seldom supported by compressed graph representations. This paper presents the k^2-tree, a novel Web graph representation based on a compact tree structure that takes advantage of large empty areas of the adjacency matrix of the graph. The representation not only retrieves successors and predecessors in symmetric fashion, but also it is particularly efficient to check for specific links between nodes, or between ranges of nodes, or to list the links between ranges. Compared to the best representations in the literature supporting successor and predecessor queries, our technique offers the least space usage (1-3 bits per link) while supporting fast navigation to predecessors and successors ([email protected] per neighbor retrieved) and sharply outperforming the others on the extended queries. The representation is also of general interest and can be used to compress other kinds of graphs and data structures.

[1]  Nieves R. Brisaboa,et al.  The SMO-index: a succinct moving object structure for timestamp and interval queries , 2012, SIGSPATIAL/GIS.

[2]  Nieves R. Brisaboa,et al.  A compact representation of graph databases , 2010, MLG '10.

[3]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[4]  Susana Ladra,et al.  Practical representations for web and social graphs , 2011, CIKM '11.

[5]  Sriram Raghavan,et al.  Representing Web graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[6]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[7]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[8]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[9]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[10]  Kazuyuki Aihara,et al.  A large-scale study of link spam detection by graph algorithms , 2007, AIRWeb '07.

[11]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[12]  G. Navarro,et al.  Compression of Web and Social Graphs supporting Neighbor and Community Queries , 2011 .

[13]  Gonzalo Navarro,et al.  Compressed Dynamic Binary Relations , 2012, 2012 Data Compression Conference.

[14]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[15]  Raymie Stata,et al.  The Link Database: fast access to graphs of the Web , 2002, Proceedings DCC 2002. Data Compression Conference.

[16]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[17]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[18]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[19]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[20]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[21]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[22]  Gonzalo Navarro,et al.  Compact Rich-Functional Binary Relation Representations , 2010, LATIN.

[23]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[24]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.

[25]  Gonzalo Navarro,et al.  k2-Trees for Compact Web Graph Representation , 2009, SPIRE.

[26]  Nieves R. Brisaboa,et al.  Compressed String Dictionaries , 2011, SEA.

[27]  J. Ian Munro,et al.  Succinct Representations of Dynamic Strings , 2010, SPIRE.

[28]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[29]  Nieves R. Brisaboa,et al.  Compressed k2-Triples for Full-In-Memory RDF Engines , 2011, AMCIS.

[30]  Kunihiko Sadakane,et al.  Ultra-succinct representation of ordered trees , 2007, SODA '07.

[31]  Simon Gog,et al.  Optimized succinct data structures for massive data , 2014, Softw. Pract. Exp..

[32]  Luca Becchetti,et al.  Link analysis for Web spam detection , 2008, TWEB.

[33]  Richard Hill,et al.  Optimizing K2 trees: A case for validating the maturity of network of practices , 2012, Comput. Math. Appl..

[34]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[35]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[36]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[37]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[38]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2006, Theor. Comput. Sci..

[39]  Gonzalo Navarro,et al.  Extended Compact Web Graph Representations , 2010, Algorithms and Applications.

[40]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[41]  Alberto Apostolico,et al.  Graph Compression by BFS , 2009, Algorithms.

[42]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[43]  Sebastiano Vigna,et al.  Permuting Web and Social Graphs , 2009, Internet Math..

[44]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[45]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[46]  Gonzalo Navarro,et al.  Compressed Representation of Web and Social Networks via Dense Subgraphs , 2012, SPIRE.

[47]  Szymon Grabowski,et al.  Merging Adjacency Lists for Efficient Web Graph Compression , 2011, ICMMI.

[48]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.

[49]  Debora Donato,et al.  Mining the inner structure of the Web graph , 2008, WebDB.

[50]  Ivan Simecek Sparse Matrix Computations Using the Quadtree Storage Format , 2009, 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[51]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 1: Sorting and Searching , 2011, EATCS Monographs on Theoretical Computer Science.

[52]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[53]  Gonzalo Navarro,et al.  Fast and Compact Web Graph Representations , 2010, TWEB.

[54]  Takao Nishizeki,et al.  Efficient Compression of Web Graphs , 2008, COCOON.

[55]  Prof. Dr. Kurt Mehlhorn,et al.  Data Structures and Algorithms 1 , 1984, EATCS.

[56]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[57]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[58]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..