A benchmarking study of classification techniques for behavioral data

The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.

[1]  Bart Baesens,et al.  Social network analysis for customer churn prediction , 2014, Appl. Soft Comput..

[2]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[3]  Tin Kam Ho,et al.  Learner excellence biased by data set selection: A case for data characterisation and artificial data sets , 2013, Pattern Recognit..

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Kurt Hornik,et al.  Benchmarking Support Vector Machines , 2002 .

[6]  David Martens,et al.  DEPARTMENT OF ENGINEERING MANAGEMENT Classification over bipartite graphs through projection , 2015 .

[7]  Foster J. Provost,et al.  Machine learning for targeted display advertising: transfer learning in action , 2013, Machine Learning.

[8]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[9]  Foster Provost,et al.  Unsupervised dimensionality reduction versus supervised regularization for classification from sparse data , 2019, Data Mining and Knowledge Discovery.

[10]  Galit Shmueli,et al.  Analyzing Behavioral Big Data: Methodological, practical, ethical, and moral issues , 2016 .

[11]  Sandeep Pandey,et al.  Learning to target: what works for behavioral targeting , 2011, CIKM '11.

[12]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[13]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[14]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[15]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[16]  P. Todd,et al.  Simple Heuristics That Make Us Smart , 1999 .

[17]  David Martens,et al.  Loyal to your city? A data mining analysis of a public service loyalty program , 2015, Decis. Support Syst..

[18]  Foster J. Provost,et al.  Corporate residence fraud detection , 2014, KDD.

[19]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[20]  Koen W. De Bock,et al.  Predicting Website Audience Demographics forWeb Advertising Targeting Using Multi-Website Clickstream Data , 2010, Fundam. Informaticae.

[21]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[22]  Foster J. Provost,et al.  Predictive Modeling With Big Data: Is Bigger Really Better? , 2013, Big Data.

[23]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[24]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[25]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[26]  Foster J. Provost,et al.  Wallenius Bayes , 2018, Machine Learning.

[27]  Vaclav Petricek,et al.  Recommender System for Online Dating Service , 2007, ArXiv.

[28]  Jiahui Liu,et al.  Personalized news recommendation based on click behavior , 2010, IUI '10.

[29]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[30]  Núria Macià,et al.  Towards UCI+: A mindful repository design , 2014, Inf. Sci..

[31]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[32]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[33]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[34]  Krishna P. Gummadi,et al.  A measurement-driven analysis of information propagation in the flickr social network , 2009, WWW '09.

[35]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[36]  J. Weijer,et al.  Word length, sentence length and frequency: Zipf revisited , 2004 .

[37]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[38]  G Gigerenzer,et al.  Reasoning the fast and frugal way: models of bounded rationality. , 1996, Psychological review.

[39]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[40]  Jingjing Lu,et al.  Comparing naive Bayes, decision trees, and SVM with AUC and accuracy , 2003, Third IEEE International Conference on Data Mining.

[41]  George Forman,et al.  Feature shaping for linear SVM classifiers , 2009, KDD.

[42]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[43]  Weiqiang Dong On Bias , Variance , 0 / 1-Loss , and the Curse of Dimensionality RK April 13 , 2014 .

[44]  Tywanquila Walker So much data, so little time: Using sequential data analysis to monitor behavioral changes , 2016, MethodsX.

[45]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[46]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[47]  Chris Volinsky,et al.  Network-Based Marketing: Identifying Likely Adopters Via Consumer Networks , 2006, math/0606278.

[48]  James Bennett,et al.  The Netflix Prize , 2007 .

[49]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[50]  Chi-Jen Lu,et al.  Tree Decomposition for Large-Scale SVM Problems , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[51]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[52]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[53]  Carla E. Brodley,et al.  Class Imbalance, Redux , 2011, 2011 IEEE 11th International Conference on Data Mining.

[54]  Thanh-Nghi Do,et al.  Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees , 2009, EGC.

[55]  Feiping Nie,et al.  New primal SVM solver with linear computational cost for big data classifications , 2014, ICML 2014.

[56]  Foster J. Provost,et al.  Scalable hands-free transfer learning for online advertising , 2014, KDD.

[57]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[58]  Geoffrey I. Webb,et al.  The Need for Low Bias Algorithms in Classification Learning from Large Data Sets , 2002, PKDD.

[59]  Liva Ralaivola,et al.  Incremental Support Vector Machine Learning: A Local Approach , 2001, ICANN.

[60]  Jose Miguel Puerta,et al.  Speeding up incremental wrapper feature subset selection with Naive Bayes classifier , 2014, Knowl. Based Syst..

[61]  Jason Weston,et al.  Large Scale Transductive SVMs , 2006, J. Mach. Learn. Res..

[62]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[63]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[64]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[65]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[66]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[67]  John F. Canny,et al.  Large-scale behavioral targeting , 2009, KDD.

[68]  Foster J. Provost,et al.  Explaining Data-Driven Document Classifications , 2013, MIS Q..

[69]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[70]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[71]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[72]  Karl-Michael Schneider On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification , 2004, EsTAL.

[73]  David Martens,et al.  Who cares about your Facebook friends? Credit scoring for microfinance , 2015 .

[74]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[75]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[76]  T. Du,et al.  Building a targeted mobile advertising system for location-based services , 2012, Decis. Support Syst..

[77]  Sharad Goel,et al.  Who Does What on the Web: A Large-Scale Study of Browsing Behavior , 2012, ICWSM.

[78]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[79]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[80]  Foster J. Provost,et al.  2015 Ieee International Conference on Big Data (big Data) Iteratively Refining Svms Using Priors , 2022 .

[81]  Tom Fawcett,et al.  Data science for business , 2013 .

[82]  Raymond J. Mooney,et al.  Symbolic and neural learning algorithms: An experimental comparison , 1991, Machine Learning.

[83]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[84]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[85]  Bin Gu,et al.  Data Sparseness in Linear SVM , 2015, IJCAI.

[86]  比戸 将平 International Conference on Data Mining (ICDM'05) , 2006 .

[87]  Brian Dalessandro Bring the Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning Algorithms , 2013, Big Data.

[88]  Shou-De Lin,et al.  Feature Engineering and Classifier Ensemble for KDD Cup 2010 , 2010, KDD 2010.

[89]  Robert Fildes,et al.  Simple versus complex forecasting : The evidence , 2015 .

[90]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[91]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[92]  Pavel Brazdil,et al.  Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks , 2006, IFIP AI.

[93]  Padhraic Smyth,et al.  KDD Cup and workshop 2007 , 2007, SKDD.

[94]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[95]  Foster J. Provost,et al.  Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics , 2016, MIS Q..

[96]  Longbing Cao,et al.  In-depth behavior understanding and use: The behavior informatics approach , 2010, Inf. Sci..

[97]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[98]  Jane Yung-jen Hsu,et al.  Item-Triggered Recommendation for Identifying Potential Customers of Cold Sellers in Supermarkets , 2004 .

[99]  Xiaohua Hu,et al.  A Data Mining Approach for Retailing Bank Customer Attrition Analysis , 2004, Applied Intelligence.