Fault Tolerant P2P RIA Crawling

Rich Internet Applications (RIAs) have been widely used in the web over the last decade as they were found to be responsive and user friendly compared to traditional web applications. Distributed RIA crawling has been introduced with the aim of decreasing the crawling time due to the large size of RIAs. However, the current RIA crawling systems do not allow for tolerance to failures that occur in one of their components. In this paper, we address the resilience problem when crawling RIAs in a distributed environment and we introduce an efficient RIA crawling system that is fault tolerant. Our approach is to partition the RIA model that results from the crawling over several storage devices in a peer-to-peer (P2P) network. This makes the distributed data structure invulnerable to the single point of failure. We introduce three data recovery mechanisms for crawling RIAs in an unreliable environment: The Retry, the Redundancy and the Combined mechanisms. We evaluate the performance of the recovery mechanisms and their impact on the crawling performance through analytical reasoning.

[1]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[2]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[3]  Gregor von Bochmann,et al.  A Scalable P2P RIA Crawling System with Partial Knowledge , 2014, NETYS.

[4]  Rüdiger Schollmeier,et al.  A definition of peer-to-peer networking for the classification of peer-to-peer architectures and applications , 2001, Proceedings First International Conference on Peer-to-Peer Computing.

[5]  Xiaozhou Li,et al.  Concurrent Maintenance of Rings , 2006, Distributed Computing.

[6]  Gregor von Bochmann,et al.  PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications , 2014, WISE.

[7]  Linda Dailey Paulson,et al.  Building Rich Web Applications with Ajax , 2005, Computer.

[8]  Chunxiao Jiang,et al.  Graph-Based AJAX Crawl: Mining Data from Rich Internet Applications , 2012, 2012 International Conference on Computer Science and Electronics Engineering.

[9]  Gregor von Bochmann,et al.  Crawling rich internet applications: the state of the art , 2012, CASCON.

[10]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[11]  David R. Karger,et al.  Analysis of the evolution of peer-to-peer systems , 2002, PODC '02.

[12]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[13]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[14]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Amos Fiat,et al.  Censorship resistant peer-to-peer content addressable networks , 2002, SODA '02.

[16]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[17]  Gregor von Bochmann,et al.  GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications , 2014, NETYS.

[18]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.