Ranking distributed probabilistic data

Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the well founded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation-efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.

[1]  Pushpraj Shukla,et al.  Efficient Constraint Monitoring Using Adaptive Thresholds , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Kamesh Munagala,et al.  A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[5]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[6]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Ambuj K. Singh,et al.  Top-k Spatial Joins of Probabilistic Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[9]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[10]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[11]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[12]  Xi Zhang,et al.  On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[13]  Xiang Lian,et al.  Probabilistic ranked queries in uncertain databases , 2008, EDBT '08.

[14]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[15]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[16]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[17]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[18]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[19]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[20]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[21]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[23]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[24]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[25]  Dimitrios Gunopulos,et al.  The threshold join algorithm for top-k queries in distributed sensor networks , 2005, DMSN '05.

[26]  Luis Gravano,et al.  Optimizing top-k selection queries over multimedia repositories , 2004, IEEE Transactions on Knowledge and Data Engineering.

[27]  Ion Stoica,et al.  Sharing aggregate computation for distributed queries , 2007, SIGMOD '07.

[28]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[29]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[30]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[32]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[33]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[34]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Jianliang Xu,et al.  Top-k Monitoring in Wireless Sensor Networks , 2007, IEEE Transactions on Knowledge and Data Engineering.

[36]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.