Indexing Metric Uncertain Data for Range Queries

Range queries in metric spaces have applications in many areas such as multimedia retrieval, computational biology, and location-based services, where metric uncertain data exists in different forms, resulting from equipment limitations, high-throughput sequencing technologies, privacy preservation, or others. In this paper, we represent metric uncertain data by using an object-level model and a bi-level model, respectively. Two novel indexes, the uncertain pivot B+-tree (UPB-tree) and the uncertain pivot B+-forest (UPB-forest), are proposed accordingly in order to support probabilistic range queries w.r.t. a wide range of uncertain data types and similarity metrics. Both index structures use a small set of effective pivots chosen based on a newly defined criterion, and employ the B+-tree(s) as the underlying index. By design, they are easy to be integrated into any existing DBMS. In addition, we present efficient metric probabilistic range query algorithms, which utilize the validation and pruning techniques based on our derived probability lower and upper bounds. Extensive experiments with both real and synthetic data sets demonstrate that, compared against existing state-of-the-art indexes for metric uncertain data, the UPB-tree and UPB-forest incur much lower construction costs, consume smaller storage spaces, and can support more efficient metric probabilistic range queries.

[1]  Edgar Chávez,et al.  Extreme Pivots for Faster Metric Indexes , 2013, SISAP.

[2]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[5]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[6]  Fabrizio Angiulli,et al.  Indexing Uncertain Data in General Metric Spaces , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jianmin Wang,et al.  Effectively Indexing the Uncertain Space , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Qi Yu,et al.  Efficient Range Query Processing on Complicated Uncertain Data , 2013 .

[9]  Christos Faloutsos,et al.  The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient , 2007, The VLDB Journal.

[10]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Haixun Wang,et al.  Distance-Constraint Reachability Computation in Uncertain Graphs , 2011, Proc. VLDB Endow..

[12]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[13]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[14]  Reynold Cheng,et al.  Efficient Evaluation of Imprecise Location-Dependent Queries , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Elke Achtert,et al.  Efficient reverse k-nearest neighbor search in arbitrary metric spaces , 2006, SIGMOD Conference.

[16]  S. Madden,et al.  UPI: A Primary Index for Uncertain Databases , 2010, Proc. VLDB Endow..

[17]  Chuan-Ming Liu,et al.  An Effective Index for Uncertain Data , 2014, 2014 International Symposium on Computer, Consumer and Control.

[18]  Yasin N. Silva,et al.  Database Similarity Join for Metric Spaces , 2013, SISAP.

[19]  Václav Snásel,et al.  Nearest Neighbours Search Using the PM-Tree , 2005, DASFAA.

[20]  Xiang Lian,et al.  A Generic Framework for Handling Uncertain Data with Local Correlations , 2010, Proc. VLDB Endow..

[21]  Yufei Tao,et al.  Reverse Nearest Neighbor Search in Metric Spaces , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Jiang Xie,et al.  Efficient Range Queries over Uncertain Strings , 2012, SSDBM.

[23]  Václav Snásel,et al.  PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases , 2004, ADBIS.

[24]  Jakub Lokoc,et al.  Clustered pivot tables for I/O-optimized similarity search , 2011, SISAP.

[25]  David Novak,et al.  Metric Index: An efficient and scalable solution for precise and approximate similarity search , 2011, Inf. Syst..

[26]  Xiang Lian,et al.  Set similarity join on probabilistic data , 2010, Proc. VLDB Endow..

[27]  Bin Wang,et al.  Indexing Uncertain Data for Supporting Range Queries , 2014, WAIM.

[28]  Zheng Li,et al.  Approximate substring matching over uncertain strings , 2011, Proc. VLDB Endow..

[29]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[30]  Yufei Tao,et al.  Range search on multidimensional uncertain data , 2007, TODS.

[31]  Yannis Theodoridis,et al.  On the Effect of Location Uncertainty in Spatial Querying , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  Ming Gao,et al.  Similarity query processing for probabilistic sets , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[33]  Enrique Vidal-Ruiz,et al.  An algorithm for finding nearest neighbours in (approximately) constant average time , 1986, Pattern Recognit. Lett..

[34]  Daniel P. Miranker,et al.  Pivot selection: Dimension reduction for distance-based indexing , 2012, J. Discrete Algorithms.

[35]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[36]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[37]  Hans-Peter Kriegel,et al.  Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data , 2011, Proc. VLDB Endow..

[38]  Xuemin Lin,et al.  Effectively indexing the multi-dimensional uncertain objects for range searching , 2012, EDBT '12.

[39]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[40]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[41]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[42]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[43]  Philip S. Yu,et al.  On High Dimensional Indexing of Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[44]  Themis Palpanas,et al.  Top-k Nearest Neighbor Search In Uncertain Data Series , 2014, Proc. VLDB Endow..

[45]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[46]  Hans-Peter Kriegel,et al.  Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces , 2010, SSDBM.

[47]  Jakub Lokoc,et al.  On indexing metric spaces using cut-regions , 2014, Inf. Syst..

[48]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.