Data reduction via adaptive sampling

Data reduction is an important issue in the field of data mining. This article de- scribes a new method for selecting a subset of data from a large dataset. A simplified chi-square criterion is proposed for measuring the goodness-of-fit between the distributions of the reduced and full data sets. Under this criterion, the data reduction problem can be formulated as a binary quadratic program and a tabu search technique is used in the search/optimization process. The procedure is adaptive in that it involves not only random sampling but also deterministic search guided by the results of the previous search steps. The method is applicable primarily to discrete data, but can be extended to continuous data as well. An experimental study that compares the proposed method with simple random sampling on a number of simulated and real world datasets has been conducted. The results of the study indicate that the distributions of the samples produced by the proposed method are significantly closer to the true distribution than those of random samples.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Usama M. Fayyad,et al.  Knowledge Discovery in Databases: An Overview , 1997, ILP.

[3]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[4]  F. Glover,et al.  Adaptive Memory Tabu Search for Binary Quadratic Programs , 1998 .

[5]  A. Winsor Sampling techniques. , 2000, Nursing times.

[6]  Sholom M. Weiss,et al.  Predictive data mining - a practical guide , 1997 .

[7]  O. Mangasarian,et al.  Pattern Recognition Via Linear Programming: Theory and Application to Medical Diagnosis , 1989 .

[8]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[9]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[10]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[11]  Jadranka Skorin-Kapov,et al.  Tabu Search Applied to the Quadratic Assignment Problem , 1990, INFORMS J. Comput..

[12]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[13]  A. Meyer-Bäse Feature Selection and Extraction , 2004 .

[14]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[15]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[16]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[17]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[18]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[19]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[20]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.