An Improved Bloom Filter in Distributed Crawler

Distributed crawlers have brought great value in both business and scientific research by crawling online data resources, while a large number of duplicate url links seriously affect the efficiency of crawlers. The bloom filter represents the set through an array of bits and uses the hash function to query the elements, which improves the efficiency of query data when space utilization is low. However, generating false positive is an inevitable problem for bloom filter. In this paper, the MD5 algorithm is used to pretreat the URL, and an improved multi-dimensional bloom filter algorithm is proposed, which effectively reduces the rate of false positive and improves the efficiency of distributed crawler.

[1]  Sun Jian,et al.  A multi-layer bloom filter for duplicated URL detection , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[2]  Daqiang Zhang,et al.  TIP: Time-Efficient Identification Protocol for Unknown RFID Tags Using Bloom Filters , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[3]  Koen Vanhoof,et al.  Detecting malicious URLs using machine learning techniques , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[4]  Ti Zhang,et al.  Design and implementation of a scalable distributed web crawler based on Hadoop , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[5]  Vinay Arora,et al.  Application of Bloom Filter for Duplicate URL Detection in a Web Crawler , 2016, 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC).

[6]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[7]  Anthony Skjellum,et al.  UDaaS: A Cloud-Based URL-Deduplication-as-a-Service for Big Datasets , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[8]  Alexei Vernitski,et al.  Routing in hexagonal computer networks: How to present paths by Bloom filters without false positives , 2016, 2016 8th Computer Science and Electronic Engineering (CEEC).

[9]  M. Young The technical writer's handbook : writing with style and clarity , 1989 .

[10]  Jihong Kim On the False Positive Rate of the Bloom Filter in Case of Using Multiple Hash Functions , 2014, 2014 Ninth Asia Joint Conference on Information Security.

[11]  Hyesook Lim,et al.  On Reducing False Positives of a Bloom Filter in Trie-Based Algorithms , 2015 .

[12]  Naresh Kumar,et al.  Framework for Distributed Semantic Web Crawler , 2015, 2015 International Conference on Computational Intelligence and Communication Networks (CICN).

[13]  Ekram Hossain,et al.  Discovering Mobile Applications in Cellular Device-to-Device Communications: Hash Function and Bloom Filter-Based Approach , 2016, IEEE Transactions on Mobile Computing.