A Fast Indexing Algorithm Optimization with User Behavior Pattern

Internet users' access pattern for objects has been observed to follow Zipf's law. The preference for network resource is showing strong influence on real-time lookup performance in large-scale distributed systems. In order to guarantee search response rate with limited memory space, we develop a new object indexing and locating algorithm called Bloom filter Arrays based on Zipf's-distributed user Preference (ZPBA). The algorithm uses a compact data structure to achieve high accuracy in item lookup. We give the theoretical analysis of ZPBA and then conduct experiments with one million item corpus and 100,000 queries to validate our design. Comparison shows that our solution can be 77% more space efficient than traditional bloom filter based index approaches for applications of concentrated user access preference. The algorithm demonstrates practical application potential in fault tolerant large-scale distributed indexing and item lookup.

[1]  Hiroshi Nakagawa,et al.  Topic models with power-law using Pitman-Yor process , 2010, KDD.

[2]  George Varghese,et al.  Beyond bloom filters: from approximate membership checks to approximate state machines , 2006, SIGCOMM 2006.

[3]  Zhu Wang,et al.  Optimizing Hash Function Number for BF-Based Object Locating Algorithm , 2012, ICSI.

[4]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[5]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[6]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7]  Stefano Giordano,et al.  MultiLayer Compressed Counting Bloom Filters , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[8]  Sasu Tarkoma,et al.  Theory and Practice of Bloom Filters for Distributed Systems , 2012, IEEE Communications Surveys & Tutorials.

[9]  R. Schiffer Psychobiology of Language , 1986 .

[10]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[11]  Tiejian Luo,et al.  Intelligent video content routing in a direct access network , 2011, 2011 3rd Symposium on Web Society.

[12]  Jun Xu,et al.  Large-scale IP traceback in high-speed internet: practical techniques and information-theoretic foundation , 2008, TNET.

[13]  Haoyu Song,et al.  Fast hash table lookup using extended bloom filter: an aid to network processing , 2005, SIGCOMM '05.

[14]  Predrag R. Jelenkovic,et al.  Characterizing the miss sequence of the LRU cache , 2008, PERV.

[15]  George Varghese,et al.  An Improved Construction for Counting Bloom Filters , 2006, ESA.

[16]  Pekka Nikander,et al.  LIPSIN: line speed publish/subscribe inter-networking , 2009, SIGCOMM '09.

[17]  Jie Gao,et al.  Weighted Bloom filter , 2006, 2006 IEEE International Symposium on Information Theory.

[18]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[19]  Hong Jiang,et al.  HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems , 2008, IEEE Transactions on Parallel and Distributed Systems.

[20]  Ravi Kumar,et al.  Compressed web indexes , 2009, WWW '09.

[21]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[22]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[23]  Yossi Azar,et al.  Algorithms - ESA 2006, 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006, Proceedings , 2006, ESA.

[24]  Pablo Rodriguez,et al.  Analysis of web caching architectures: hierarchical and distributed caching , 2001, TNET.

[25]  Venkatesan Guruswami,et al.  2006 IEEE International Symposium on Information Theory , 2006 .

[26]  Hiroaki Kobayashi,et al.  Modeling of cache access behavior based on Zipf's law , 2008, MEDEA '08.

[27]  T. Apostol Mathematical Analysis , 1957 .

[28]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .