论文信息 - Intelligent sampling for big data using bootstrap sampling and chebyshev inequality

Intelligent sampling for big data using bootstrap sampling and chebyshev inequality

The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data.

A. Satyanarayana

[1] H. Chernoff. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[2] C. L. Mallows,et al. Inequalities of Chebyshev Type Involving Conditional Expectations , 1969 .

[3] Leslie G. Valiant,et al. A theory of the learnable , 1984, CACM.

[4] S. Kanemaki,et al. A Theory for the , 1986 .

[5] Pat Langley,et al. Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[6] D. Haussler,et al. Rigorous learning curve bounds from statistical mechanics , 1996 .

[7] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[8] Tim Oates,et al. Efficient progressive sampling , 1999, KDD '99.

[9] Bo Thiesson,et al. The Learning-Curve Sampling Method Applied to Model-Based Clustering , 2002, J. Mach. Learn. Res..

[10] Ashwin Satyanarayana,et al. A Dynamic Adaptive Sampling Algorithm (DASA) for Real World Applications: Finger Print Recognition and Face Recognition , 2005, ISMIS.

[11] Soumya Pandit,et al. Adaptive Sampling Algorithm for ANN-based Performance Modeling of Nano-scale CMOS Inverter , 2011 .

[12] Selen Cremaschi,et al. An algorithm to determine sample sizes for optimization with artificial neural networks , 2013 .

[13] Jessica Daecher. Sampling Methods For Applied Research Text And Cases , 2016 .