Proxisch: An Optimization Approach of Large-Scale Unstable Proxy Servers Scheduling

Nowadays, big companies such as Google, Microsoft, which have adequate proxy servers, have perfectly implemented their web crawlers for a certain website in parallel. But due to lack of expensive proxy servers, it is still a puzzle for researchers to crawl large amounts of information from a single website in parallel. In this case, it is a good choice for researchers to use free public proxy servers which are crawled from the Internet. In order to improve efficiency of web crawler, the following two issues should be considered primarily: (1) Tasks may fail owing to the instability of free proxy servers; (2) A proxy server will be blocked if it visits a single website frequently. In this paper, we propose Proxisch, an optimization approach of large-scale unstable proxy servers scheduling, which allow anyone with extremely low cost to run a web crawler efficiently. Proxisch is designed to work efficiently by making maximum use of reliable proxy servers. To solve second problem, it establishes a frequency control mechanism which can ensure the visiting frequency of any chosen proxy server below the website’s limit. The results show that our approach performs better than the other scheduling algorithms. Keywords—Proxy server, priority queue, optimization approach, distributed web crawling.

[1]  Wolfgang Kellerer,et al.  Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[2]  Mário J. Silva,et al.  Automated Social Network Epidemic Data Collector , 2009 .

[3]  Wahyu Catur Wibowo,et al.  A Fast Distributed Focused-Web Crawling , 2014 .

[4]  Peng Zhou Zhang,et al.  The Crawler of Specific Resources Recognition Based on Multi-thread , 2012, 2012 Fifth International Joint Conference on Computational Sciences and Optimization.

[5]  Salvatore Catanese,et al.  Crawling Facebook for social network analysis purposes , 2011, WIMS '11.

[6]  Damon McCoy,et al.  Proximax : A Measurement Based System for Proxies Dissemination , 2010 .

[7]  Ling Wang,et al.  An effective hybrid PSO-based algorithm for flow shop scheduling with limited buffers , 2008, Comput. Oper. Res..

[8]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[9]  Günter Schmidt,et al.  Scheduling with limited machine availability , 2000, Eur. J. Oper. Res..

[10]  Nikita Borisov,et al.  rBridge: User Reputation based Tor Bridge Distribution with Privacy Preservation , 2013, NDSS.

[11]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[12]  Ruchika Patel,et al.  A Survey on Semantic Focused Web Crawler for Information Discovery Using Data Mining Technique , 2014 .

[13]  Prateek Mittal,et al.  SecGraph: A Uniform and Open-source Evaluation System for Graph Data Anonymization and De-anonymization , 2015, USENIX Security Symposium.

[14]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Philip S. Yu,et al.  COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency , 2015, KDD.

[16]  Willy Susilo,et al.  BLACR: TTP-Free Blacklistable Anonymous Credentials with Reputation , 2012, NDSS.