Imbalanced customer classification for bank direct marketing

This paper aims to contribute insights on data analytics methodologies when applied to direct marketing. From a business perspective, the objective is to unveil those banking customers who are most likely to respond positively to a term deposit marketing campaign. Mathematically, this is a typical classification problem; however, in our case, the class of interest is relatively rare and the dataset imbalanced. The paper offers a comparison of performance between statistical, distance-based, induction and Machine Learning classification algorithms on predicting potential depositors, when trained with imbalanced datasets. The main effort focuses on rebalancing effectively the datasets during training so as to reverse the negative effect of imbalance and to increase the correct classifications for the under-represented class. Distance-based and cluster-based resampling techniques are applied in comparison and in combination in order to understand how customer targeting could become more effective for practitioners. Using a publicly available dataset for direct marketing of bank products, we study the influence of resampling techniques on the different algorithms and conclude that our proposed cluster-based technique is overall the most effective in relation to other well-established techniques.

[1]  Wright-Patterson Afb,et al.  Feature Selection Using a Multilayer Perceptron , 1990 .

[2]  E. Ngai Customer relationship management research (1992‐2002): An academic literature review and classification , 2005 .

[3]  S. J. Press,et al.  Choosing between Logistic Regression and Discriminant Analysis , 1978 .

[4]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  K. Wisaeng A Comparison of Different Classification Techniques for Bank Direct Marketing , 2013 .

[9]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[10]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[11]  Robert C. Blattberg,et al.  Interactive Marketing: Exploiting the Age of Addressability , 1991 .

[12]  Rashedur M. Rahman,et al.  Decision Tree and Naïve Bayes Algorithm for Classification and Generation of Actionable Knowledge for Direct Marketing , 2013 .

[13]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[14]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[15]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[16]  Wei Liu,et al.  Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets , 2011, PAKDD.

[17]  P. Berger,et al.  Customer lifetime value: Marketing models and applications , 1998 .

[18]  Kishan G. Mehrotra,et al.  An improved algorithm for neural network classification of imbalanced training sets , 1993, IEEE Trans. Neural Networks.

[19]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[20]  Nikolaos M. Avouris,et al.  EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM , 2006, Appl. Artif. Intell..

[21]  Ulrik Schroeder,et al.  A reference model for learning analytics , 2012 .

[22]  Paul D. Berger,et al.  The effect of sample size and proportion of buyers in the sample on the performance of list segmentation equations generated by regression analysis , 1992 .

[23]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[24]  D. M. Titterington,et al.  Do unbalanced data have a negative effect on LDA? , 2008, Pattern Recognit..

[25]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[26]  Sven F. Crone,et al.  The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing , 2006, Eur. J. Oper. Res..

[27]  Hany A. Elsalamony,et al.  Bank Direct Marketing Analysis of Data Mining Techniques , 2014 .

[28]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[29]  Paulo Cortez,et al.  Using data mining for bank direct marketing: an application of the CRISP-DM methodology , 2011 .

[30]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[31]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[32]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[33]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[34]  Paul D. Berger,et al.  New Customer Acquisition: Prospecting Models and the use of Commercially Available External Data , 1995 .

[35]  Stijn Claessens,et al.  How Does Foreign Entry Affect the Domestic Banking Market? , 1998 .

[36]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[37]  Xiaohua Hu,et al.  A Data Mining Approach for Retailing Bank Customer Attrition Analysis , 2004, Applied Intelligence.

[38]  J. Paetz Campaign management design based on segmentation by rank clusters , 2015 .

[39]  Ekrem Duman,et al.  Comparing alternative classifiers for database marketing: The case of imbalanced datasets , 2012, Expert Syst. Appl..

[40]  Lilian Sing,et al.  Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing , 2013 .