D-Search: an efficient and exact search algorithm for large distribution sets

Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2,300 times faster than the naive implementation without sacrificing accuracy.

[1]  Christos Faloutsos,et al.  BRAID: stream mining through group lag correlations , 2005, SIGMOD '05.

[2]  Stephen D. Bay,et al.  The UCI KDD archive of large data sets for data mining research and experimentation , 2000, SKDD.

[3]  Christos Faloutsos,et al.  Tri-plots: scalable tools for multidimensional data mining , 2001, KDD '01.

[4]  Mikhail Belkin,et al.  Data spectroscopy: learning mixture models using eigenspaces of convolution operators , 2008, ICML '08.

[5]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[6]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Hiroyuki Kitagawa,et al.  A Dynamic Mobility Histogram Construction Method Based on Markov Chains , 2006, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[10]  B. Prabhakaran,et al.  Segmentation and recognition of multi-attribute motion sequences , 2004, MULTIMEDIA '04.

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  Anthony K. H. Tung,et al.  Estimating local optimums in EM algorithm over Gaussian mixture model , 2008, ICML '08.

[13]  Wei Lee Woon,et al.  String alignment for automated document versioning , 2008, Knowledge and Information Systems.

[14]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[15]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[16]  Qian Liu,et al.  Improving keyword based web image search with visual feature distribution and term expansion , 2009, Knowledge and Information Systems.

[17]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[18]  Jernej Barbic,et al.  Segmenting Motion Capture Data into Distinct Behaviors , 2004, Graphics Interface.

[19]  Stan Z. Li,et al.  Jensen-Shannon boosting learning for object recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Jean-Philippe Vert,et al.  Adaptive context trees and text clustering , 2001, IEEE Trans. Inf. Theory.

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  Christos Faloutsos,et al.  Enhanced max margin learning on multimodal data mining in a multimedia database , 2007, KDD '07.

[23]  William H. Press,et al.  Numerical recipes in C , 2002 .

[24]  Shashi Shekhar,et al.  Context inclusive function evaluation: a case study with EM-based multi-scale multi-granular image classification , 2009, Knowledge and Information Systems.

[25]  Stephen J. Roberts,et al.  Adaptive Classification by Variational Kalman Filtering , 2002, NIPS.

[26]  Philip S. Yu,et al.  Optimal multi-scale patterns in time series streams , 2006, SIGMOD Conference.

[27]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[28]  William F. Punch,et al.  Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[29]  Zhaohui Sun Adaptation for multiple cue integration , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[30]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[31]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[32]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[33]  Charu C. Aggarwal On classification and segmentation of massive audio data streams , 2008, Knowledge and Information Systems.

[34]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[35]  G. Pfurtscheller,et al.  Differentiation between finger, toe and tongue movement in man based on 40 Hz EEG. , 1994, Electroencephalography and clinical neurophysiology.

[36]  Like Gao,et al.  Continuous similarity-based queries on streaming time series , 2005, IEEE Transactions on Knowledge and Data Engineering.