Characterizing a national community web

This article presents a characterization of the community Web of the people of Portugal. We defined criteria for delimiting this Web based on our past experience of crawling pages related to Portugal and collected over 3.2 million documents from 46,000 sites satisfying those criteria. Our characterization was derived from this crawl. We describe the rules that we established for defining the boundaries of this community Web and the methodology used to gather statistics. Statistics cover the number and domain distribution of sites; the number, type and size distribution of text documents; and the linkage structure of this Web. We also show how crawling constraints and abnormal situations on the Web can influence the statistics.

[1]  Michael Day,et al.  Collecting and preserving the world wide web , 2003 .

[2]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[3]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[4]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[5]  P. Zabicka Archiving the Czech Web: Issues and Challenges , 2003 .

[6]  Mário J. Silva,et al.  The Case for a Portuguese Web Search Engine , 2003, ICWI.

[7]  James E. Pitkow Summary of WWW characterizations , 2004, World Wide Web.

[8]  Gregory Grefenstette,et al.  Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[9]  David Barr,et al.  Common DNS Operational and Configuration Errors , 1996, RFC.

[10]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.

[11]  Punpiti Piamsa-nga,et al.  Measuring and Analysis of the Thai World Wide Web , 2000 .

[12]  Ken Harrenstien,et al.  Nicname/whois , 1982, RFC.

[13]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[14]  Jon Postel,et al.  Domain Name System Structure and Delegation , 1994, RFC.

[15]  Diana Santos,et al.  Measuring the Web in Portuguese , 2002 .

[16]  James E. Pitkow,et al.  Summary of WWW characterizations , 1998, World Wide Web.

[17]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[18]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[19]  Terence Kelly,et al.  Aliasing on the world wide web: prevalence and performance implications , 2002, WWW '02.

[20]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[21]  Ram Periakaruppan,et al.  GTrace - A Graphical Traceroute Tool , 1999 .

[22]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[23]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[24]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[25]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[26]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[27]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[28]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[29]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[30]  Shun-Tak Albert Leung,et al.  Towards Web-scale Web Archaeology , 2001 .

[31]  Jeffrey C. Mogul Errors in timestamp-based HTTP header values , 2007 .

[32]  Rick Bennett,et al.  Trends in the Evolution of the Public Web: 1998 - 2002 , 2003, D Lib Mag..

[33]  Matthew Zook Internet metrics: using host and domain counts to map the internet , 2000 .

[34]  Monika Henzinger,et al.  Algorithmic Challenges in Web Search Engines , 2004, Internet Math..

[35]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[36]  Colin Webb,et al.  Towards a preserved national collection of selected Australian digital publications , 2000 .

[37]  Ian Dickinson,et al.  A Means for Expressing Location Information in the Domain Name System , 1996, RFC.

[38]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[39]  José Luis Borbinha,et al.  A Deposit for Digital Collections , 2001, ECDL.

[40]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[41]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[42]  Jeffrey C. Mogul,et al.  A trace-based analysis of duplicate suppression in HTTP , 2000 .