论文信息 - Efficient Distribution Mining and Classification

Efficient Distribution Mining and Classification

We define and solve the problem of “distribution classification”, and, in general, “distribution mining”. Given n distributions (i.e., clouds) of multi-dimensional points, we want to classify them into k classes, to find patterns, rules and out-lier clouds. For example, consider the 2-d case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution/cloud of 2-d points (one for each item he bought). We want to group similar users together, e.g., for market segmentation, anomaly/fraud detection. We propose D-Mine to achieve this goal. Our main contribution is Theorem 3.1, which shows how to use wavelets to speed up the cloud-similarity computations. Extensive experiments on both synthetic and real multidimensional data sets show that our method achieves up to 400 faster wall-clock time over the naive implementation, with comparable (and occasionally better) classification quality.

Christos Faloutsos | Lei Li | Yasushi Sakurai | Rosalynn Chong

[1] Christos Faloutsos,et al. Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[2] William H. Press,et al. Numerical recipes in C , 2002 .

[3] Sunil Prabhakar,et al. Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[4] Jeffrey Scott Vitter,et al. Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[5] Keinosuke Fukunaga,et al. Introduction to Statistical Pattern Recognition , 1972 .

[6] Jeffrey Scott Vitter,et al. Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[7] Dennis Shasha,et al. FinTime: a financial time series benchmark , 1999, SGMD.

[8] Zhaohui Sun. Adaptation for multiple cue integration , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[9] S. Muthukrishnan,et al. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[10] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[11] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[12] Christos Faloutsos,et al. Tri-plots: scalable tools for multidimensional data mining , 2001, KDD '01.

[13] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[14] Yufei Tao,et al. Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[15] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[16] Wray L. Buntine,et al. Learning classification trees , 1992 .

[17] Leonidas J. Guibas,et al. The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[18] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[19] Stan Z. Li,et al. Jensen-Shannon boosting learning for object recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20] Jean-Philippe Vert,et al. Adaptive context trees and text clustering , 2001, IEEE Trans. Inf. Theory.