Probabilistic Threshold Join over Distributed Uncertain Data

Large amount of uncertain data is collected by many emerging applications which contain multiple sources in a distributed manner. Previous efforts on querying uncertain data in distributed environment have only focus on ranking and skyline, join queries have not been addressed in earlier work despite their importance in databases. In this paper, we address distributed probabilistic threshold join query, which retrieves results satisfying the join condition with combining probabilities that meet the threshold requirement from distributed sites. We propose a new kind of bloom filters called Probability Bloom Filters (PBF) to represent set with probabilistic attribute and design a PBF based Bloomjoin algorithm for executing distributed probabilistic threshold join query with communication efficiency. Furthermore, we provide theoretical analysis of the network cost of our algorithm and demonstrate it by simulation. The experiment results show that our algorithm can save network cost efficiently by comparing to original Bloomjoin algorithm in most scenarios.

[1]  Heng Tao Shen,et al.  Multi-source Skyline Query Processing in Road Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[3]  Wolfgang Nejdl,et al.  Improving distributed join efficiency with extended bloom filter operations , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).

[4]  Wolfgang Nejdl,et al.  Cardinality estimation and dynamic length adaptation for Bloom filters , 2010, Distributed and Parallel Databases.

[5]  Hai Jin,et al.  Efficient and Progressive Algorithms for Distributed Skyline Queries over Uncertain Data , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[7]  Parag Agrawal,et al.  Confidence-Aware Join Algorithms , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Subramanian Arumugam,et al.  Evaluation of probabilistic threshold queries in MCDB , 2010, SIGMOD Conference.

[9]  Jeffrey Xu Yu,et al.  Advances in Data and Web Management, Joint 9th Asia-Pacific Web Conference, APWeb 2007, and 8th International Conference, on Web-Age Information Management, WAIM 2007, Huang Shan, China, June 16-18, 2007, Proceedings , 2007, APWeb/WAIM.

[10]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[11]  Odysseas Papapetrou,et al.  Optimizing Distributed Joins with Bloom Filters , 2008, ICDCIT.

[12]  Mao Ye,et al.  Probabilistic Top-k query processing in distributed sensor networks , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[14]  Li Fan,et al.  Summary cache: a scalable wide-area Web cache sharing protocol , 1998, SIGCOMM '98.

[15]  Feifei Li,et al.  Ranking distributed probabilistic data , 2009, SIGMOD Conference.

[16]  Xuemin Lin,et al.  Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data , 2009, APWeb/WAIM.

[17]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[18]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[19]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[20]  Guy M. Lohman,et al.  R* optimizer validation and performance evaluation for local queries , 1986, SIGMOD '86.