Detecting click fraud in online advertising: a data mining approach

Click fraud-the deliberate clicking on advertisements with no real interest on the product or service offered-is one of the most daunting problems in online advertising. Building an effective fraud detection method is thus pivotal for online advertising businesses. We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, opening the opportunity for participants to work on real-world fraud data from BuzzCity Pte. Ltd., a global mobile advertising company based in Singapore. In particular, the task is to identify fraudulent publishers who generate illegitimate clicks, and distinguish them from normal publishers. The competition was held from September 1 to September 30, 2012, attracting 127 teams from more than 15 countries. The mobile advertising data are unique and complex, involving heterogeneous information, noisy patterns with missing values, and highly imbalanced class distribution. The competition results provide a comprehensive study on the usability of data mining-based fraud detection approaches in practical setting. Our principal findings are that features derived from fine-grained time-series analysis are crucial for accurate fraud detection, and that ensemble methods offer promising solutions to highly-imbalanced nonlinear classification tasks with mixed variable types and noisy/missing patterns. The competition data remain available for further studies at http://palanteer.sis.smu.edu.sg/fdma2012/.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[4]  G. Box NON-NORMALITY AND TESTS ON VARIANCES , 1953 .

[5]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[8]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[9]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[10]  Kyuseok Shim,et al.  CATCH: A detecting algorithm for coalition attacks of hit inflation in internet advertising , 2011, Inf. Syst..

[11]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[12]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[13]  Harry Zhang,et al.  A Fast Decision Tree Learning Algorithm , 2006, AAAI.

[14]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[15]  LoDavid,et al.  Detecting click fraud in online advertising , 2014 .

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[18]  Jostein Oysad The Lane ’ s Gifts v . Google Report , 2006 .

[19]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[20]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[23]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[26]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[27]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[28]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[29]  Kate Smith-Miles,et al.  Resilient Identity Crime Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[30]  Yin Zhang,et al.  Measuring and fingerprinting click-spam in ad networks , 2012, CCRV.

[31]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[33]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[34]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[35]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[36]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[37]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  Yi Zhu,et al.  Click Fraud , 2009, Mark. Sci..

[40]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[41]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[42]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[43]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[44]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[45]  Geoff Holmes,et al.  Multiclass Alternating Decision Trees , 2002, ECML.