GPU accelerated information retrieval using Bloom filters

Information retrieval is a technique used in search engines, advertisement placement and cognitive databases. With increasing amounts of data and stringent response time requirements, improving the underlying implementation of document retrieval becomes critical. To this end, we consider a Bloom filter, a simple randomized data structure that answers membership queries with no false negative and customizable false positive probability. Mainly, we focus on the speed-up of the algorithm by using a Graphics Processing Units (GPU) based implementation. Starting from a regular CPU implementation of the Bloom filter algorithm, we employ different optimization techniques on the two basic Bloom filter operations: mapping and querying. An important speed-up is achieved for both operations: over 300x for mapping, and over 20x for querying. Furthermore, we show that the number of hash functions used during the mapping operation, the number of files, and the number of query words have a significant effect on the execution time and the speed-up.

[1]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[2]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[3]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[4]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[7]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[8]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[9]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[11]  Emilio L. Zapata,et al.  Improving Signatures by Locality Exploitation for Transactional Memory , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[12]  Manish Vachharajani,et al.  An efficient software transactional memory using commit-time invalidation , 2010, CGO '10.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[15]  Yong Guan,et al.  Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[16]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..