Indexing the Earth Mover's Distance Using Normal Distributions

Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-- nearest neighbor (K--NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K--NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.

[1]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[2]  Trevor Darrell,et al.  Fast contour matching using approximate earth mover's distance , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[3]  Ambuj K. Singh,et al.  Indexing Spatially Sensitive Distance Measures Using Multi-resolution Lower Bounds , 2006, EDBT.

[4]  Hung-Khoon Tan,et al.  Localized matching using Earth Mover's Distance towards discovery of common patterns from small image samples , 2009, Image Vis. Comput..

[5]  S. H. Srinivasan Local earth mover's distance and face warping [multimedia object distance measure] , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[6]  Hai Zhou,et al.  Statistical Timing Analysis With Coupling , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Ira Assent,et al.  Efficient EMD-based similarity search in multimedia databases via flexible dimensionality reduction , 2008, SIGMOD Conference.

[8]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[9]  Ambuj K. Singh,et al.  APLA: Indexing Arbitrary Probability Distributions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Hongyuan Zha,et al.  A new Mallows distance based metric for comparing clusterings , 2005, ICML '05.

[11]  Anthony K. H. Tung,et al.  Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance , 2010, The VLDB Journal.

[12]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[13]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[14]  Ira Assent,et al.  Approximation Techniques for Indexing the Earth Mover’s Distance in Multimedia Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  L. Guibas,et al.  The Earth Mover''s Distance: Lower Bounds and Invariance under Translation , 1997 .

[16]  G. Lieberman,et al.  Introduction to Mathematical Programming , 1990 .

[17]  Mtw,et al.  Mass Transportation Problems: Vol. I: Theory@@@Mass Transportation Problems: Vol. II: Applications , 1999 .

[18]  Kenneth Ward Church,et al.  Nonlinear Estimators and Tail Bounds for Dimension Reduction in l1 Using Cauchy Random Projections , 2006, J. Mach. Learn. Res..

[19]  CharikarMoses,et al.  On the impossibility of dimension reduction in l1 , 2005 .

[20]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[21]  Tobias Meisen,et al.  Efficient similarity search using the Earth Mover's Distance for large multimedia databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  H. Levy Stochastic dominance and expected utility: survey and analysis , 1992 .