Enhancing Density-Based Data Reduction Using Entropy

Data reduction algorithms determine a small data subset from a given large data set. In this article, new types of data reduction criteria, based on the concept of entropy, are first presented. These criteria can evaluate the data reduction performance in a sophisticated and comprehensive way. As a result, new data reduction procedures are developed. Using the newly introduced criteria, the proposed data reduction scheme is shown to be efficient and effective. In addition, an outlier-filtering strategy, which is computationally insignificant, is developed. In some instances, this strategy can substantially improve the performance of supervised data analysis. The proposed procedures are compared with related techniques in two types of application: density estimation and classification. Extensive comparative results are included to corroborate the contributions of the proposed algorithms.

[1]  C. A. Murthy,et al.  Density-Based Multiscale Data Condensation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[3]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[4]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[5]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[6]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[9]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[10]  Mark Plutowski,et al.  Selecting concise training sets from clean data , 1993, IEEE Trans. Neural Networks.

[11]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[12]  Jerome H. Friedman,et al.  DATA MINING AND STATISTICS: WHAT''S THE CONNECTION , 1997 .

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Tommy W. S. Chow,et al.  An online cellular probabilistic self-organizing map for static and dynamic data sets , 2004, IEEE Transactions on Circuits and Systems I: Regular Papers.

[15]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[16]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[17]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[18]  Alireza Khotanzad,et al.  Classification of invariant image representations using a neural network , 1990, IEEE Trans. Acoust. Speech Signal Process..

[19]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[20]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[21]  M M Astrahan SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[22]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[23]  Mark Zwolinski,et al.  Mutual Information Theory for Adaptive Mixture Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[25]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[26]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[27]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[28]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[29]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[30]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[31]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.