DynWebStats: A Framework for Determining Dynamic and Up-to-date Web Indicators

It has been broadly discussed over the last years about the growth and popularity of the Internet and, more specifically, about the World Wide Web and its services and applications. Despite being common sense, acquiring indicators about this growth and characteristics of the whole Web, or event parts of it, is a big challenge, which can be explained by some factors: (1) the constant and dynamical evolution of the Web in many dimensions, that is, any analysis becomes obsolete instantly as soon as it's ready; (2) the great volume of data that is necessary to generate indicators, which is usually disruptive in terms of bandwidth and storage. There are also problems related to ethics and network viability of the crawl; and (3) the coverage and newness to generate indicators, whether indicators about domains or Web pages. This paper presents a new methodology for generating dynamic Web indicators, which consider Web pages changes, both in terms of its modifications and its creation or deletion. This methodology provides a rational crawling and offers a measure of the quality of the indicators. In order to validate it, we run a simulation that uses a dataset with 8.690 Web pages that were downloaded daily for 134 days. The results show that it's possible to crawl a greater universe of Web pages and still keep indicators between acceptable levels of confidence, turning it possible to have a snapshot of this universe as close to reality as possible.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Christopher Krügel,et al.  Relevant change detection: a framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines , 2014, WWW.

[3]  Tamal Das,et al.  Sequence Estimation over Finite-State Markov Channel via the Expectation Maximization Algorithm , 2007 .

[4]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[5]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[6]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[7]  Ahmed Patel,et al.  Empirical evaluation of the link and content-based focused Treasure-Crawler , 2013, Comput. Stand. Interfaces.

[8]  Ari Pirkola,et al.  Addressing the limited scope problem of focused crawling using a result merging approach , 2010, SAC '10.

[9]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Ricardo Baeza-Yates,et al.  Um novo retrato da Web brasileira , 2005 .

[11]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[12]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[13]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[14]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[15]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[16]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[17]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[18]  Wojciech Rytter,et al.  Efficient web searching using temporal factors , 1999, Theor. Comput. Sci..

[19]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[20]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[21]  M. Koster,et al.  Robots in the Web : threat or treat ? , 1995, WWW Spring 1995.

[22]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[24]  Vivian Cothey,et al.  Web-crawling reliability , 2004, J. Assoc. Inf. Sci. Technol..

[25]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[26]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[27]  A. K. Sharma,et al.  Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[28]  Zhenglu Yang,et al.  Fires on the Web: Towards Efficient Exploring Historical Web Graphs , 2010, DASFAA.

[29]  Jerome Talim,et al.  Controlling the robots of Web search engines , 2001, SIGMETRICS '01.

[30]  Herbert Van de Sompel,et al.  Archival HTTP redirection retrieval policies , 2013, WWW '13 Companion.

[31]  Kristinn Sigurðsson Incremental Crawling with Heritrix , 2010 .

[32]  Ricardo A. Baeza-Yates,et al.  Understanding Content Reuse on the Web: Static and Dynamic Analyses , 2006, WEBKDD.

[33]  Daniel Gomes,et al.  Design and Selection Criteria for a National Web Archive , 2006, ECDL.