The structure of broad topics on the web

The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

[1]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[2]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .

[7]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[8]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[9]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[11]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[12]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[13]  Christopher R. Palmer,et al.  Generating network topologies that obey power laws , 2000, Globecom '00 - IEEE. Global Telecommunications Conference. Conference Record (Cat. No.00CH37137).

[14]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[15]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[16]  Susan T. Dumais,et al.  Probabilistic combination of content and links , 2001, SIGIR '01.

[17]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[18]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[21]  Filippo Menczer Links tell us about lexical and semantic Web content , 2001, ArXiv.

[22]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[23]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[24]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[25]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[26]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[27]  Soumen Chakrabarti,et al.  Surfing the Web Backwards , 1999, Comput. Networks.

[28]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[29]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.