Clustering and Classification Based on Distributed Automatic Feature Engineering for Customer Segmentation

To beat competition and obtain valuable information, decision-makers must conduct in-depth machine learning or data mining for data analytics. Traditionally, clustering and classification are two common methods used in machine mining. For clustering, data are divided into various groups according to the similarity or common features. On the other hand, classification refers to building a model by given training data, where the target class or label is predicted for the test data. In recent years, many researchers focus on the hybrid of clustering and classification. These techniques have admirable achievements, but there is still room to ameliorate performances, such as distributed process. Therefore, we propose clustering and classification based on distributed automatic feature engineering (AFE) for customer segmentation in this paper. In the proposed algorithm, AFE uses artificial bee colony (ABC) to select valuable features of input data, and then RFM provides the basic data analytics. In AFE, it first initializes the number of cluster k. Moreover, the clustering methods of k-means, Wald method, and fuzzy c-means (FCM) are processed to cluster the examples in variant groups. Finally, the classification method of an improved fuzzy decision tree classifies the target data and generates decision rules for explaining the detail situations. AFE also determines the value of the split number in the improved fuzzy decision tree to increase classification accuracy. The proposed clustering and classification based on automatic feature engineering is distributed, performed in Apache Spark platform. The topic of this paper is about solving the problem of clustering and classification for machine learning. From the results, the corresponding classification accuracy outperforms other approaches. Moreover, we also provide useful strategies and decision rules from data analytics for decision-makers.

[1]  Ram C. Rao,et al.  Supermarket Competition: The Case of Every Day Low Pricing , 1997 .

[2]  P. Anitha,et al.  RFM model for customer purchase behavior using K-Means algorithm , 2019, J. King Saud Univ. Comput. Inf. Sci..

[3]  Kun Guo,et al.  Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining , 2012 .

[4]  Zhihua Cui,et al.  Improving artificial Bee colony algorithm using a new neighborhood selection mechanism , 2020, Inf. Sci..

[5]  Ming Zhong,et al.  I-nice: A new approach for identifying the number of clusters and initial cluster centres , 2018, Inf. Sci..

[6]  K. Glanz,et al.  The healthy food marketing strategies study: design, baseline characteristics, and supermarket compliance. , 2020, Translational behavioral medicine.

[7]  Deli Jia,et al.  Feature Selection of Grey Wolf Optimizer Based on Quantum Computing and Uncertain Symmetry Rough Set , 2019, Symmetry.

[8]  Arnd Florack,et al.  When products compete for consumers attention: How selective attention affects preferences , 2020 .

[9]  Shaoliang Peng,et al.  Bioinformatics applications on Apache Spark , 2018, GigaScience.

[10]  Hadi Roshan,et al.  The new approach in market segmentation by using RFM model , 2017 .

[11]  Haibin Zhu,et al.  An Adaptive Fuzzy kNN Text Classifier Based on Gini Index Weight , 2006, 11th IEEE Symposium on Computers and Communications (ISCC'06).

[12]  Octavian Dospinescu,et al.  Key Factors Determining the Expected Benefit of Customers When Using Bank Cards: An Analysis on Millennials and Generation Z in Romania , 2019, Symmetry.

[13]  Bart Rienties,et al.  The role of demographics in online learning; A decision tree based approach , 2019, Comput. Educ..

[14]  Louis Wehenkel,et al.  A complete fuzzy decision tree technique , 2003, Fuzzy Sets Syst..

[15]  So-Tsung Chou,et al.  A hybrid system for imbalanced data mining , 2019, Microsystem Technologies.

[16]  Sutrisno,et al.  Customer Segmentation based on RFM model and Clustering Techniques With K-Means Algorithm , 2018, 2018 Third International Conference on Informatics and Computing (ICIC).

[17]  Zne-Jung Lee,et al.  Parameter determination of support vector machine and feature selection using simulated annealing approach , 2008, Appl. Soft Comput..

[18]  Hsin-Hung Wu,et al.  A review of the application of RFM model , 2010 .

[19]  Jun Liu,et al.  Discovering Knowledge by Comparing Silhouettes Using K-Means Clustering for Customer Segmentation , 2020, Int. J. Knowl. Manag..

[20]  Selective Inference for Hierarchical Clustering. , 2020, 2012.02936.

[21]  Pingping Xiong,et al.  The Gini coefficient structure and its application for the evaluation of regional balance development in China , 2018, Journal of Cleaner Production.

[22]  F. Mücklich,et al.  Objective homogeneity quantification of a periodic surface using the Gini coefficient , 2020, Scientific reports.

[23]  Sonal Jain,et al.  Comparative Study of K-means and Fuzzy C-means Algorithms on The Breast Cancer Data , 2018 .

[24]  Arzu Mammadova,et al.  Segmenting Bank Customers via RFM Model and Unsupervised Machine Learning , 2020, ArXiv.

[25]  B D Satoto,et al.  Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster , 2018, IOP Conference Series: Materials Science and Engineering.

[26]  Nazori Suhandi,et al.  Clustering optimization in RFM analysis Based on k-Means , 2020 .