APLA: Indexing Arbitrary Probability Distributions

The ability to store and query uncertain information is of great benefit to databases that infer values from a set of observations, including databases of moving objects, sensor readings, historical business transactions, and biomedical images. These observations are often inexact to begin with, and even if they are exact, a set of observations of an attribute of an object is better represented by a probability distribution than by a single number, such as a mean. In this paper, we present adaptive, piecewise-linear approximations (APLAs), which represent arbitrary probability distributions compactly with guaranteed quality. We also present the APLA-tree, an index structure for APLAs. Because APLA is more precise than existing approximation techniques, the APLA-tree can answer probabilistic range queries twice as fast. APLA generalizes to multiple dimensions, and the APLA-tree can index multivariate distributions using either one-dimensional or multidimensional APLAs. Finally, we propose a new definition of k-NN queries on uncertain data. The new definition allows APLA and the APLA-tree to answer k-NN queries quickly, even on arbitrary probability distributions. No efficient k-NN search was previously possible on such distributions.

[1]  Theodore Johnson,et al.  Range selectivity estimation for continuous attributes , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[2]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[3]  Subhash Suri,et al.  Surface approximation and geometric partitions , 1994, SODA '94.

[4]  David M. Nicol,et al.  Rectilinear Partitioning of Irregular Data Parallel Computations , 1994, J. Parallel Distributed Comput..

[5]  Philippe Bonnet,et al.  GADT: a probability space ADT for representing and querying the physical world , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[7]  Ambuj K. Singh,et al.  A distributed database for bio-molecular images , 2004, SGMD.

[8]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[9]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[10]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[11]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[12]  Torsten Suel,et al.  Approximation algorithms for array partitioning problems , 2005, J. Algorithms.

[13]  Aristide Mingozzi,et al.  Partitioning a Matrix to Minimize the Maximum Cost , 1995, Discret. Appl. Math..

[14]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[16]  D. T. Lee,et al.  Geometric complexity of some location problems , 1986, Algorithmica.