Change Detection and Notification of Web Pages

The majority of currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages, and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages that contain information that is of value to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process and to notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights the current challenges and areas of improvement present within the field of research.

[1]  V. S. Dhaka,et al.  Web Crawler: A Review , 2013 .

[2]  Sampath Jayarathna,et al.  Optimizing change detection in distributed digital collections: An architectural perspective of change detection , 2017, 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[3]  Sharma Chakravarthy,et al.  CX-DIFF: A Change Detection Algorithm for XML Content and Change Presentation Issues for WebVigiL , 2003, ER.

[4]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[5]  Xiang Kun The Shark-Search algorithm based on clustering links , 2006 .

[6]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[7]  Cédric du Mouza,et al.  Characterizing Web Syndication Behavior and Content , 2011, WISE.

[8]  Sornalingam Nadaraj Distributed Content Aggregation & Content Change Detection using Bloom Filters , 2016 .

[9]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[10]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[11]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[12]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[13]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[14]  James Edward Keogh ASP.NET 2.0 Demystified , 2005 .

[15]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[16]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[17]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[18]  C. Lee Giles,et al.  The Ethicality of Web Crawlers , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[19]  Robert W. Sebesta Programming the World Wide Web , 2001 .

[20]  José Rufino,et al.  Efficient Partitioning Strategies for Distributed Web Crawling , 2007, ICOIN.

[21]  Adeel Anjum,et al.  Aiding web crawlers; projecting web page last modification , 2012, 2012 15th International Multitopic Conference (INMIC).

[22]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[23]  S. Minu,et al.  A Comparative Study of Image Change Detection Algorithms in MATLAB , 2015 .

[24]  Phil McMinn,et al.  ReDeCheck: an automatic layout failure checking tool for responsively designed web pages , 2017, ISSTA.

[25]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[26]  David Eichmann,et al.  2 – Background : Agents in General and Spiders in Particular , 1994 .

[27]  Tadayoshi Kohno,et al.  Detecting In-Flight Page Changes with Web Tripwires , 2008, NSDI.

[28]  Donald B. Johnson,et al.  Efficient Algorithms for Shortest Paths in Sparse Networks , 1977, J. ACM.

[29]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[30]  Christopher Krügel,et al.  Relevant change detection: a framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines , 2014, WWW.

[31]  Di Zou,et al.  Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications , 2013, 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[32]  Kalaiarasi Sonai Muthu Anbananthen,et al.  Focused Web Crawler , .

[33]  A. Parrish Managing change. , 1996, Nursing management.

[34]  Ricardo A. Baeza-Yates,et al.  Web Dynamics, Structure, and Page Quality , 2004, Web Dynamics.

[35]  Christoph Lange,et al.  Linked Data Notifications: A Resource-Centric Communication Protocol , 2017, ESWC.

[36]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[37]  Ah Chung Tsoi,et al.  A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[38]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[39]  Herbert Van de Sompel,et al.  A Technical Framework for Resource Synchronization , 2013, D Lib Mag..

[40]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[41]  George C. Canavos A Bayesian Approach to Parameter and Reliability Estimation in the Poisson Distribution , 1972 .

[42]  Gregor von Bochmann,et al.  A Statistical Approach for Efficient Crawling of Rich Internet Applications , 2012, ICWE.

[43]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[44]  Babak Bagheri Hariri,et al.  A Method for Focused Crawling Using Combination of Link Structure and Content Similarity , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[45]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[46]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[48]  Song Zheng Genetic and Ant Algorithms Based Focused Crawler Design , 2011, 2011 Second International Conference on Innovations in Bio-inspired Computing and Applications.

[49]  Peter Han Joo Chong,et al.  An automatic layout faults detection technique in responsive web pages considering JavaScript defined dynamic layouts , 2016, 2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT).

[50]  Hartanto Kusuma Wardana,et al.  Focused Crawler Optimization Using Genetic Algorithm , 2011 .

[51]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[52]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[53]  Mark Levene,et al.  Web dynamics : adapting to change in content, size, topology and use , 2004 .

[54]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[55]  Arthur Bebak,et al.  Creating Web Pages For Dummies , 1996 .

[56]  M. Sunil Kumar,et al.  Design and Implementation of Scalable, Fully Distributed Web Crawler for a Web Search Engine , 2011 .

[57]  Filip Radlinski,et al.  Detecting duplicate web documents using clickthrough data , 2011, WSDM '11.

[58]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[59]  Arie van Deursen,et al.  Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes , 2012, TWEB.

[60]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[61]  Roy T. Fielding,et al.  Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web , 1994, Comput. Networks ISDN Syst..

[62]  Susan T. Dumais,et al.  Leveraging temporal dynamics of document content in relevance ranking , 2010, WSDM '10.

[63]  Divakar Yadav,et al.  Change Detection in Web Pages , 2007 .

[64]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[65]  A. K. Sharma,et al.  Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[66]  Sampath Jayarathna,et al.  Change detection optimization in frequently changing web pages , 2017, 2017 Moratuwa Engineering Research Conference (MERCon).

[67]  Atul Patel,et al.  Web Crawler : Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities , 2017 .

[68]  Sampath Jayarathna,et al.  Random Forest Classifier based Scheduler Optimization for Search Engine Web Crawlers , 2018, ICSCA.

[69]  Sharma Chakravarthy,et al.  Automating Change Detection and Notification of Web Pages (Invited Paper) , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[70]  Gregor von Bochmann,et al.  A Strategy for Efficient Crawling of Rich Internet Applications , 2011, ICWE.

[71]  Wallace Koehler,et al.  A longitudinal study of Web pages continued: a consideration of document persistence , 2003, Inf. Res..

[72]  Faryaneh Poursardar,et al.  Change detection and classification of digital collections , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[73]  Krzysztof Filipowski Comparison of Scheduling Algorithms for Domain Specific Web Crawler , 2014, 2014 European Network Intelligence Conference.

[74]  Sukyoung Ryu,et al.  Automatic Detection of Visibility Faults by Layout Changes in HTML5 Web Pages , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[75]  Fatemeh Ahmadi-Abkenari,et al.  An architecture for a focused trend parallel Web crawler with the application of clickstream analysis , 2012, Inf. Sci..

[76]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[77]  Herbert Van de Sompel,et al.  A Perspective on Resource Synchronization , 2012, D Lib Mag..

[78]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[79]  Swati Mali Focused Web Crawler with Page Change Detection Policy , 2011 .

[80]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.

[81]  Harold W. Sorenson,et al.  Parameter estimation in Poisson processes (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[82]  Jürgen Umbrich,et al.  Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources , 2010, LDOW.

[83]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[84]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[85]  G. Beged-Dov RDF Site Summary (RSS) 1.0 , 2001 .

[86]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[87]  Mike Thelwall,et al.  Web crawling ethics revisited: Cost, privacy, and denial of service , 2006 .

[88]  Carrie Grimes Microscale evolution of web pages , 2008, WWW.

[89]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[90]  Fred Douglis,et al.  The AT&T Internet Difference Engine: Tracking and viewing changes on the web , 1998, World Wide Web.

[91]  Jussara M. Almeida,et al.  Learning to Schedule Webpage Updates Using Genetic Programming , 2013, SPIRE.

[92]  Licia Calvi,et al.  Creating Adaptive Hyperdocuments for and on the Web , 1997, WebNet.

[93]  D. M. Hutton,et al.  Web Dynamics - Adapting to Change in Content, Size, Topology and Use , 2006 .

[94]  Steven L. Tanimoto,et al.  Reusing Web Documents in Tutorials With the Current-Documents Assumption: Automatic Validation of Updates , 1999 .

[95]  Victor Carneiro,et al.  Distributed and collaborative Web Change Detection system , 2015, Comput. Sci. Inf. Syst..

[96]  Sampath Jayarathna,et al.  Adaptive technique for web page change detection using multi-threaded crawlers , 2017, 2017 Seventh International Conference on Innovative Computing Technology (INTECH).

[97]  Sharma Chakravarthy,et al.  CX-DIFF: a change detection algorithm for XML content and change visualization for WebVigiL , 2005, Data Knowl. Eng..

[98]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[99]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[100]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[101]  Pierre Senellart,et al.  Deriving Dynamics of Web Pages: A Survey , 2011, TWAW.

[102]  Z. Dalai,et al.  Managing distributed collections: evaluating Web page changes, movement, and replacement , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[103]  C. Lee Giles,et al.  A large-scale study of robots.txt , 2007, WWW '07.

[104]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[105]  H P Khandagale,et al.  A Web Page Change Detection System For Selected Zone UsingTree Comparison Technique , 2014 .

[106]  Sharma Chakravarthy,et al.  WebVigil: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments , 2002, WebDyn@WWW.

[107]  Stéphane Gançarski,et al.  Vi-DIFF: Understanding Web Pages Changes , 2010, DEXA.

[108]  Fred Douglis,et al.  Tracking and Viewing Changes on the Web , 1996, USENIX Annual Technical Conference.

[109]  Sampath Jayarathna,et al.  Detection of change frequency in web pages to optimize server-based scheduling , 2017, 2017 Seventeenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[110]  Frank M. Shipman,et al.  Managing change on the web , 2001, JCDL '01.

[111]  Mike Thelwall,et al.  Web crawling ethics revisited: Cost, privacy, and denial of service , 2006, J. Assoc. Inf. Sci. Technol..

[112]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[113]  Uri Schonfeld,et al.  Sitemaps: above and beyond the crawl of duty , 2009, WWW '09.

[114]  Herbert Van de Sompel,et al.  Resource Harvesting within the OAI-PMH Framework , 2004, D Lib Mag..

[115]  Wallace Koehler,et al.  Web page change and persistence - A four-year longitudinal study , 2002, J. Assoc. Inf. Sci. Technol..

[116]  Jun Li,et al.  Focused crawling by exploiting anchor text using decision tree , 2005, WWW '05.

[117]  Stéphane Gançarski,et al.  Archiving the web using page changes patterns: a case study , 2011, JCDL '11.