Prediction of new outlinks for focused Web crawling

Discovering new hyperlinks enables Web crawlers to find new pages that have not yet been indexed. This is especially important for focused crawlers because they strive to provide a comprehensive analysis of specific parts of the Web, thus prioritizing discovery of new pages over discovery of changes in content. In the literature, changes in hyperlinks and content have been usually considered simultaneously. However, there is also evidence suggesting that these two types of changes are not necessarily related. Moreover, many studies about predicting changes assume that long history of a page is available, which is unattainable in practice. The aim of this work is to provide a methodology for detecting new hyperlinks effectively using a short history. To this end, we use a dataset of ten crawls at intervals of one week. Our study consists of three parts. First, we obtain insight in the data by analyzing empirical properties of the number of new outlinks. We observe that these properties are, on average, stable over time, but there is a large difference between emergence of hyperlinks towards pages within and outside the domain of a target page (internal and external outlinks, respectively). Next, we provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links. These models include the features used earlier in the literature, as well as new features introduced in this work. We analyze correlation between the features, and investigate their informativeness. A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new hyperlinks in the target page. Finally, we propose ranking methods as guidelines for focused crawlers to efficiently discover new pages, and demonstrate that they achieve excellent performance with respect to the corresponding targets.

[1]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[4]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[5]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[6]  Wallace Koehler,et al.  Web page change and persistence - A four-year longitudinal study , 2002, J. Assoc. Inf. Sci. Technol..

[7]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[8]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[9]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[12]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[13]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[14]  Ana Carolina Salgado,et al.  Looking at both the present and the past to efficiently update replicas of web content , 2005, WIDM '05.

[15]  Norman Matloff Estimation of internet file-access/modification rates from indirect data , 2005, TOMC.

[16]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[17]  Sanasam Ranbir Singh Estimating the Rate of Web Page Updates , 2007, IJCAI.

[18]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[19]  Carrie Grimes Microscale evolution of web pages , 2008, WWW.

[20]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[21]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[22]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[23]  Predicting Web Page Status , 2010, Inf. Syst. Res..

[24]  Susan T. Dumais,et al.  Leveraging temporal dynamics of document content in relevance ranking , 2010, WSDM '10.

[25]  Prasenjit Mitra,et al.  Clustering-based incremental web crawling , 2010, TOIS.

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Stéphane Gançarski,et al.  Archiving the web using page changes patterns: a case study , 2011, JCDL '11.

[28]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[29]  D. Plenz,et al.  powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions , 2013, PloS one.

[30]  Jussara M. Almeida,et al.  A genetic programming framework to schedule webpage updates , 2014, Information Retrieval Journal.

[31]  Mariacarla Calzarossa,et al.  Modeling and predicting temporal patterns of web content changes , 2015, J. Netw. Comput. Appl..

[32]  Dhaval Patel,et al.  AcT: Accuracy-aware crawling techniques for cloud-crawler , 2015, World Wide Web.

[33]  Juliana Freire,et al.  A First Study on Temporal Dynamics of Topics on the Web , 2016, WWW.

[34]  Pawel Czarnul,et al.  Parallelization of large vector similarity computations in a hybrid CPU+GPU environment , 2018, The Journal of Supercomputing.

[35]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[36]  Dafna Shahaf,et al.  Tractable near-optimal policies for crawling , 2018, Proceedings of the National Academy of Sciences.

[37]  Sampath Jayarathna,et al.  Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers , 2018, ICSCA.

[38]  Eric Horvitz,et al.  Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling , 2019, NeurIPS.

[39]  Andrew Y. Ng,et al.  NGBoost: Natural Gradient Boosting for Probabilistic Prediction , 2019, ICML.

[40]  Bhaskar Biswas,et al.  Link prediction techniques, applications, and performance: A survey , 2020 .

[41]  Konstantin Avrachenkov,et al.  Change Rate Estimation and Optimal Freshness in Web Page Crawling , 2020, VALUETOOLS.

[42]  Ao Li,et al.  Fast top-K Cosine Similarity Search through XOR-Friendly Binary Quantization on GPUs , 2020, ArXiv.

[43]  Konstantin Avrachenkov,et al.  Online Algorithms for Estimating Change Rates of Web Pages , 2020, Performance Evaluation.

[44]  Eamonn J. Keogh,et al.  Time series motifs discovery under DTW allows more robust discovery of conserved structure , 2021, Data Mining and Knowledge Discovery.