Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Set-valued data is a natural and concise representation for modeling complex objects. As an important operation of object-oriented or object-relational database, set containment query processing over set-valued data has been extensively studied in previous works. Recently, there is a growing realization that uncertain information is a first-class citizen in modern database management. As such, there is a strong demand for study of set containment queries over uncertain set-valued data. This paper investigates how set-containment queries over uncertain set-valued data can be efficiently processed. Based on the popular possible world semantics, we first present a practical model in which the uncertainty in set-valued data is represented by existential probabilities, and propose the probabilistic set containment semantics and its generalization - the expected Jaccard containment. Second, to avoid expensive computations in enumerating all possible worlds, we develop efficient schemes for computing these two probabilistic semantics. Third, we introduce two important queries, namely probability threshold containment query (PTCQ) and probability threshold containment join (PTCJ), and propose novel techniques to process them efficiently. Finally, we conduct extensive experiments to study the efficiency of the proposed methods. The experimental results indicate that the proposed methods are efficient in processing the uncertain set containment queries.

[1]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[2]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[3]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[5]  Bin Jiang,et al.  Ranking uncertain sky: The probabilistic top-k skyline operator , 2011, Inf. Syst..

[6]  Sven Helmer,et al.  A performance study of four index structures for set-valued attributes of low cardinality , 2003, The VLDB Journal.

[7]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Nikos Mamoulis,et al.  Similarity search in sets and categorical data using the signature tree , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9]  Xiang Lian,et al.  Probabilistic inverse ranking queries in uncertain databases , 2011, The VLDB Journal.

[10]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[11]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[12]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Xiang Lian,et al.  Set similarity join on probabilistic data , 2010, Proc. VLDB Endow..

[14]  Xiang Lian,et al.  Finding the least influenced set in uncertain databases , 2011, Inf. Syst..

[15]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[16]  Jian Pei,et al.  Ranking queries on uncertain data , 2010, The VLDB Journal.

[17]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[18]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[19]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.

[20]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[21]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[22]  Michael Stonebraker,et al.  Object-Relational DBMSs: The Next Great Wave , 1995 .

[23]  Yufei Tao,et al.  Range search on multidimensional uncertain data , 2007, TODS.

[24]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[25]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Abraham Silberschatz,et al.  Extended algebra and calculus for nested relational databases , 1988, TODS.

[27]  Hans-Peter Kriegel,et al.  Scalable Probabilistic Similarity Ranking in Uncertain Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Timos K. Sellis,et al.  Efficient answering of set containment queries for skewed item distributions , 2011, EDBT/ICDT '11.

[29]  H. Zimmermann,et al.  Fuzzy Set Theory and Its Applications , 1993 .

[30]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[31]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[32]  Shan Wang,et al.  Combining intensional with extensional query evaluation in tuple independent probabilistic databases , 2011, Inf. Sci..

[33]  Shaojie Tang,et al.  Canopy closure estimates with GreenOrbs: sustainable sensing in the forest , 2009, SenSys '09.

[34]  Philip S. Yu,et al.  A new method for similarity indexing of market basket data , 1999, SIGMOD '99.

[35]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[36]  Serge Abiteboul,et al.  On the representation and querying of sets of possible worlds , 1987, SIGMOD '87.

[37]  Xiang Lian,et al.  Efficient processing of probabilistic reverse nearest neighbor queries over uncertain data , 2009, The VLDB Journal.

[38]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[39]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[40]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[41]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[42]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Ihab F. Ilyas,et al.  Efficient search for the top-k probable nearest neighbors in uncertain databases , 2008, Proc. VLDB Endow..

[44]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[45]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[46]  Xiang Lian,et al.  Reverse skyline search in uncertain databases , 2008, TODS.

[47]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[48]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[49]  Frederick Hayes-Roth,et al.  Rule-based systems , 1985, CACM.

[50]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[51]  Ihab F. Ilyas,et al.  Supporting ranking queries on uncertain and incomplete data , 2010, The VLDB Journal.

[52]  Xiang Lian,et al.  Shooting top-k stars in uncertain databases , 2011, The VLDB Journal.

[53]  Jian Pei,et al.  Probabilistic Reverse Nearest Neighbor Queries on Uncertain Data , 2010, IEEE Transactions on Knowledge and Data Engineering.

[54]  Ambuj K. Singh,et al.  Top-k Spatial Joins of Probabilistic Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[55]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[56]  Xiang Lian,et al.  Probabilistic Group Nearest Neighbor Queries in Uncertain Databases , 2008, IEEE Transactions on Knowledge and Data Engineering.

[57]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .