Aggregate queries on constrained probabilistic similarity join pairs

Abstract Join aggregate queries, where join operations are followed by aggregate operations, are very common in data processing. In some application scenarios where data are incomplete and ambiguous, probabilistic similarity join (PSJ) is widely used, which assigns each joined pair a probability to reflect the likelihood that the pair belongs to the join result set. According to the mapping constraints, we formally define the possible world semantics for three PSJ types (i.e., many-to-many, one-to-many, and one-to-one), and propose algorithms to evaluate aggregate queries on these constrained PSJ pairs. First, for many-to-many PSJ pairs, we model them with a tuple-level uncertainty model, and propose two aggregate algorithms based on dynamic programming and divide-and-conquer strategy, respectively. Then, we model one-to-many PSJ pairs with an attribute-level uncertainty model, and extend the aggregate algorithms for many-to-many PSJ pairs to this model. Finally, we model one-to-one PSJ pairs with a probabilistic graphical model, and propose a new aggregate algorithm that is based on a combination of generating function method, dynamic programming, and divide-and-conquer strategy. Extensive experiments on real datasets have demonstrated order-of-magnitude improvements of our algorithms over baselines.

[1]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[3]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[4]  Subhash Suri,et al.  Range-max queries on uncertain data , 2018, J. Comput. Syst. Sci..

[5]  Kenli Li,et al.  Reporting l most influential objects in uncertain databases based on probabilistic reverse top-k queries , 2017, Inf. Sci..

[6]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[7]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[8]  Guido Moerkotte,et al.  Accelerating queries with group-by and join by groupjoin , 2011, Proc. VLDB Endow..

[9]  Robert B. Ross,et al.  Aggregate operators in probabilistic databases , 2005, JACM.

[10]  Jinli Cao,et al.  Trustworthy answers for top-k queries on uncertain Big Data in decision making , 2015, Inf. Sci..

[11]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[12]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[13]  Patrick Valduriez,et al.  Efficient Evaluation of SUM Queries over Probabilistic Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[15]  Christopher Ré,et al.  Management of data with uncertainties , 2007, CIKM '07.

[16]  B. John Oommen,et al.  Spelling correction using probabilistic methods , 1984, Pattern Recognit. Lett..

[17]  Xiang Lian,et al.  Set similarity join on probabilistic data , 2010, Proc. VLDB Endow..

[18]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[19]  Per-Ake Larson,et al.  Performing Group-By before Join , 1994, ICDE 1994.

[20]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[21]  Jianzhong Li,et al.  Sampling Based (epsilon, delta)-Approximate Aggregation Algorithm in Sensor Networks , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[22]  Dimitrios Gunopulos,et al.  Approximating Aggregation Queries in Peer-to-Peer Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Lijun Chang,et al.  Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[25]  Gang Chen,et al.  Indexing metric uncertain data for range queries and range joins , 2017, The VLDB Journal.

[26]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[27]  Jian Pei,et al.  Aggregate queries on probabilistic record linkages , 2012, EDBT '12.

[28]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[29]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[30]  Anna Liu,et al.  PODS: a new model and processing algorithms for uncertain data streams , 2010, SIGMOD Conference.

[31]  Bin Wang,et al.  LS-Join: Local Similarity Join on String Collections , 2017, IEEE Transactions on Knowledge and Data Engineering.

[32]  Lei Chen,et al.  Continuous monitoring of skylines over uncertain data streams , 2012, Inf. Sci..

[33]  Benxiong Huang,et al.  Probabilistic Threshold Join over Distributed Uncertain Data , 2011, WAIM.

[34]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[35]  Jennifer Widom,et al.  Making Aggregation Work in Uncertain and Probabilistic Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.