Density-Based Multiscale Data Condensation

A problem gaining interest in pattern recognition applied to data mining is that of selecting a small representative subset from a very large data set. In this article, a nonparametric data reduction scheme is suggested. It attempts to represent the density underlying the data. The algorithm selects representative points in a multiscale fashion which is novel from existing density-based approaches. The accuracy of representation by the condensed set is measured in terms of the error in density estimates of the original and reduced sets. Experimental studies on several real life data sets show that the multiscale approach is superior to several related condensation methods both in terms of condensation ratio and estimation error. The condensed set obtained was also experimentally shown to be effective for some important data mining tasks like classification, clustering, and rule generation on large data sets. Moreover, it is empirically found that the algorithm is efficient in terms of sample complexity.

[1]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[2]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[3]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[4]  John C. Platt A Resource-Allocating Network for Function Interpolation , 1991, Neural Computation.

[5]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[6]  Francesco Ricci,et al.  Data Compression and Local Metrics for Nearest Neighbor Classification , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Andrew W. Moore,et al.  Efficient Locally Weighted Polynomial Regression Predictions , 1997, ICML.

[8]  Sankar K. Pal,et al.  Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing , 1999 .

[9]  Yiu-Fai Wong,et al.  A new clustering algorithm applicable to multispectral and polarimetric SAR images , 1993, IEEE Trans. Geosci. Remote. Sens..

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Yee Leung,et al.  Clustering by Scale-Space Filtering , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  D. L. Reilly,et al.  A neural model for category learning , 1982, Biological Cybernetics.

[13]  S. Pal,et al.  Segmentation of remotely sensed images with fuzzy thresholding, and quantitative evaluation , 2000 .

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[16]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[17]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[18]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[19]  Ramasamy Uthurusamy,et al.  Data mining and knowledge discovery in databases , 1996, CACM.

[20]  Erkki Oja,et al.  Rival penalized competitive learning for clustering analysis, RBF net, and curve detection , 1993, IEEE Trans. Neural Networks.

[21]  András Faragó,et al.  Nearest neighbor search and classification in O(1) time , 1991 .

[22]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[23]  C. A. Murthy,et al.  Finding a Subset of Representative Points in a Data Set , 1994, IEEE Trans. Syst. Man Cybern. Syst..

[24]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[25]  Joydeep Ghosh,et al.  Scale-based clustering using the radial basis function network , 1996, IEEE Trans. Neural Networks.

[26]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[27]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[28]  Mark Plutowski,et al.  Selecting concise training sets from clean data , 1993, IEEE Trans. Neural Networks.

[29]  K. Fukunaga,et al.  Nonparametric Data Reduction , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[31]  A. Aspin Tables for use in comparisons whose accuracy involves two variances, separately estimated. , 1949, Biometrika.

[32]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[33]  M M Astrahan SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[34]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .