Range counting coresets for uncertain data

We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximation-size tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based approaches on axis-aligned range queries.

[1]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[2]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[3]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[4]  Géza Bohus,et al.  On the Discrepancy of 3 Permutations , 1990, Random Struct. Algorithms.

[5]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[6]  Maarten Löffler,et al.  Delaunay triangulation of imprecise points in linear time after preprocessing , 2010, Comput. Geom..

[7]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[8]  Jennifer Widom,et al.  Representing uncertain data: models, properties, and algorithms , 2009, The VLDB Journal.

[9]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[10]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[11]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..

[12]  Joseph S. B. Mitchell,et al.  Triangulating input-constrained planar point sets , 2008, Inf. Process. Lett..

[13]  Jack Snoeyink,et al.  Almost-Delaunay simplices: nearest neighbor relations for imprecise points , 2004, SODA '04.

[14]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  Aravind Srinivasan,et al.  The discrepancy of permutation families , 1997 .

[16]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[17]  Maarten Löffler,et al.  Shape Fitting on Point Sets with Probability Distributions , 2008, ESA.

[18]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[19]  D. Salesin,et al.  Constructing strongly convex approximate hulls with inaccurate primitives , 1990, Algorithmica.

[20]  Xuemin Lin,et al.  Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data , 2009, APWeb/WAIM.

[21]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[22]  Bernard Chazelle,et al.  The Discrepancy Method , 1998, ISAAC.

[23]  Timothy M. Chan,et al.  Closest Pair and the Post Office Problem for Stochastic Points , 2011, WADS.

[24]  Pankaj K. Agarwal,et al.  Small and stable descriptors of distributions for geometric statistical problems , 2009 .

[25]  Nando de Freitas,et al.  The Unscented Particle Filter , 2000, NIPS.

[26]  Jirí Matousek,et al.  Approximations and optimal geometric divide-and-conquer , 1991, STOC '91.

[27]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[28]  Maarten Löffler,et al.  Largest bounding box, smallest diameter, and related problems on imprecise points , 2007, Comput. Geom..

[29]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[30]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2008, TODS.

[31]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[32]  Ying Zhang,et al.  Efficient Computation of Range Aggregates against Uncertain Location-Based Queries , 2012, IEEE Transactions on Knowledge and Data Engineering.

[33]  Leonidas J. Guibas,et al.  Epsilon geometry: building robust algorithms from imprecise computations , 1989, SCG '89.

[34]  Maarten Löffler,et al.  Geometric Computations on Indecisive and Uncertain Points , 2012, ArXiv.

[35]  J. M. Phillips Algorithms for ε-approximations of Terrains ? , 2008 .

[36]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[37]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[38]  Bernard Chazelle,et al.  On linear-time deterministic algorithms for optimization problems in fixed dimension , 1996, SODA '93.

[39]  Maarten Löffler,et al.  Geometric Computations on Indecisive Points , 2011, WADS.

[40]  József Beck,et al.  Roth’s estimate of the discrepancy of integer sequences is nearly sharp , 1981, Comb..

[41]  Takayuki Nagai,et al.  Tight Error Bounds of Geometric Problems on Convex Objects with Imprecise Coordinates , 2000, JCDCG.

[42]  Leo Joskowicz,et al.  Uncertainty envelopes , 2005, EuroCG.

[43]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[44]  Timothy M. Chan,et al.  Stochastic minimum spanning trees in euclidean spaces , 2011, SoCG '11.

[45]  Heinrich Kruger Basic Measures for Imprecise Point Sets in R d , 2008 .

[46]  Haim Kaplan,et al.  Counting colors in boxes , 2007, SODA '07.

[47]  Bernard Chazelle,et al.  Quasi-optimal range searching in spaces of finite VC-dimension , 1989, Discret. Comput. Geom..