Adaptive Sampling Scheme for Learning in Severely Imbalanced Large Scale Data

Imbalanced data poses a serious challenge for many machine learning and data mining applications. It may significantly affect the performance of learning algorithms. In digital marketing applications, events of interest (positive instances for building predictive models) such as click and purchase are rare. A retail website can easily receive a million visits every day, yet only a small percentage of visits lead to purchase. The large amount of raw data and the small percentage of positive instances make it challenging to build decent predictive models in a timely fashion. In this paper, we propose an adaptive sampling strategy to deal with this problem. It efficiently returns high quality training data, ensures system responsiveness and improves predictive performances.

[1]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[2]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Byoung-Tak Zhang,et al.  Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[5]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[6]  J. Habbema,et al.  Prognostic Modeling with Logistic Regression Analysis , 2001, Medical decision making : an international journal of the Society for Medical Decision Making.

[7]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[8]  Carla E. Brodley,et al.  Class Imbalance, Redux , 2011, 2011 IEEE 11th International Conference on Data Mining.

[9]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[10]  Rok Blagus,et al.  Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.