Local representativeness in vector data

The amount of large-scale real data around us is increasing in size very quickly, as is the necessity to reduce its size by obtaining a representative sample. Such sample allows us to use a great variety of analytical methods, the direct application of which on original data would be unfeasible. Conventional sampling methods provide non-deterministic results trying to preserve selected characteristics of the input dataset. We present a novel, simple, straightforward and deterministic approach with the same goal. It is not sampling in the true sense but a reduction of vector data, which maintains very well internal data structures (clusters and density). The approach is based on analyzing the nearest neighbors. Our suggested x-representativeness then takes into account the local density of the data and nearest neighbors of individual data objects. Following that, we also present experiments with two different datasets. The aim of these experiments is to show that the x-representativeness can be used to deterministically reduce the datasets to differently sized samples of representatives, while maintaining properties of the original datasets.

[1]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[2]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[3]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[4]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[5]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[6]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[8]  Yannis Manolopoulos,et al.  An efficient and effective algorithm for density biased sampling , 2002, CIKM '02.

[9]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[10]  Pairote Sattayatham,et al.  Weighted K-Means for Density-Biased Clustering , 2005, DaWaK.

[11]  Olli Nevalainen,et al.  An Algorithm for Unbiased Random Sampling , 1982, Comput. J..

[12]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[13]  Jeffrey Scott Vitter,et al.  Faster methods for random sampling , 1984, CACM.

[14]  Jing Cao,et al.  Combining Sampling Technique with DBSCAN Algorithm for Clustering Large Spatial Databases , 2000, PAKDD.

[15]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.