Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance

Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited effectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have proven their values for better distinguishing ability in the probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover's Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and k-nearest neighbor queries on probabilistic data, on the basis of Earth Mover's Distance. Our solution utilizes the primal-dual theory in linear programming and deploys B+ tree index structures for effective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases.

[1]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[2]  Klaus H. Hinrichs,et al.  Managing uncertainty in moving objects databases , 2004, TODS.

[3]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[4]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[5]  Yufei Tao,et al.  Range search on multidimensional uncertain data , 2007, TODS.

[6]  Ira Assent,et al.  Approximation Techniques for Indexing the Earth Mover’s Distance in Multimedia Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[8]  Sunil Prabhakar,et al.  U-DBMS: A Database System for Managing Constantly-Evolving Data , 2005, VLDB.

[9]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[10]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[11]  Dan Suciu,et al.  Probabilistic Event Extraction from RFID Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[13]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[14]  Hermann Ney,et al.  Discriminative training for object recognition using image patches , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Haibin Ling,et al.  An Efficient Earth Mover's Distance Algorithm for Robust Histogram Comparison , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Trevor Darrell,et al.  Fast contour matching using approximate earth mover's distance , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[17]  Wei Hong,et al.  Model-based approximate querying in sensor networks , 2005, The VLDB Journal.

[18]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[19]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Anthony K. H. Tung,et al.  Similarity Search on Bregman Divergence: Towards Non-Metric Indexing , 2009, Proc. VLDB Endow..

[21]  Samuel Madden,et al.  Using Probabilistic Models for Data Management in Acquisitional Environments , 2005, CIDR.

[22]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[23]  Alexandr Andoni,et al.  Earth mover distance over high-dimensional spaces , 2008, SODA '08.

[24]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[25]  David W. Jacobs,et al.  Approximate earth mover’s distance in linear time , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Michael Lindenbaum,et al.  Nonnegative Matrix Factorization with Earth Mover's Distance metric , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Daisy Zhe Wang,et al.  Querying probabilistic information extraction , 2010, Proc. VLDB Endow..

[28]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[29]  Beng Chin Ooi,et al.  On multi-column foreign key discovery , 2010, Proc. VLDB Endow..

[30]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[31]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[32]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[33]  Ira Assent,et al.  Efficient EMD-based similarity search in multimedia databases via flexible dimensionality reduction , 2008, SIGMOD Conference.

[34]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[35]  Joachim M. Buhmann,et al.  Empirical Evaluation of Dissimilarity Measures for Color and Texture , 2001, Comput. Vis. Image Underst..

[36]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[37]  Wei Hong,et al.  Approximate Data Collection in Sensor Networks using Probabilistic Models , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[38]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.