Cloud deployment of game theoretic categorical clustering using apache spark: An application to car recommendation

Abstract Personal vehicles are invariably being preferred over public transport nowadays. Contact-less feature inspection and analysis based on personal preferences will be in high demand among customers in the post-pandemic world. A comprehensive online car recommendation system will be the customers’ spontaneous choice to understand and select the features of vehicles. However, the clustering of such categorical features is a challenging task as it is difficult to compare two textual attributes. In this paper, we have designed a cloud-based system that will automatically address this issue. Motivated by the cooperative game theory and fuzzy technique, and integrating the concept of Shapley theorem, a categorical data clustering algorithm has been developed. At the same time, to overcome the major limitation of having a high time complexity of the order O ( n 2 ) associated with the Shapley computation, the proposed algorithm has been distributed using Apache Spark’s Map Reduce architecture in Google Cloud Platform. The model has been thoroughly validated based on its performance on several synthetic as well as real data sets. Finally, a car recommendation system has been proposed and tested on three car sell data sets. The proposed approach outperforms the corresponding existing categorical clustering approaches in terms of various clustering validity indices. To the best of the authors’ knowledge, this is the first attempt to apply Map Reduce based Shapley computation over the categorical clustering, which can find its application beyond the proposed car recommendation system as well.

[1]  Richard Tay,et al.  Consumer preferences and policy implications for the green car market , 2016 .

[2]  Zhiqiang Wang,et al.  Consumer preferences for battery electric vehicles: A choice experimental survey in China , 2020 .

[3]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[4]  L. Shapley Cores of convex games , 1971 .

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Francesco Scarcello,et al.  On the Shapley value and its application to the Italian VQR research assessment exercise , 2019, J. Informetrics.

[7]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[8]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[9]  C. B. Tilanus,et al.  Applied Economic Forecasting , 1966 .

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Luis E. Zárate,et al.  Categorical data clustering: What similarity measure to recommend? , 2015, Expert Syst. Appl..

[12]  Gayatri,et al.  A Research Paper on Digital Marketing Communication and Consumer Buying Decision Process: An Empirical Study in the Indian Passenger Car Market , 2018 .

[13]  Mohamed A. Ismail,et al.  Fuzzy clustering for symbolic data , 1998, IEEE Trans. Fuzzy Syst..

[14]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[15]  Claudia Nobis,et al.  Transport mode use during the COVID-19 lockdown period in Germany: The car became more important, public transport lost ground , 2021, Transport Policy.

[16]  Richard W. Hamming,et al.  Coding and Information Theory , 2018, Feynman Lectures on Computation.

[17]  Hai Jiang,et al.  Car ownership policies in China: Preferences of residents and influence on the choice of electric cars , 2017 .

[18]  Ruggero G. Pensa,et al.  From Context to Distance: Learning Dissimilarity for Categorical Data Clustering , 2012, TKDD.

[19]  J. Bezdek Numerical taxonomy with fuzzy sets , 1974 .

[20]  CAR MARKET AND BUYING BEHAVIOR- A STUDY OF CONSUMER PERCEPTION , 2012 .

[21]  J. Bezdek,et al.  DETECTION AND CHARACTERIZATION OF CLUSTER SUBSTRUCTURE I. LINEAR STRUCTURE: FUZZY c-LINES* , 1981 .

[22]  Ching-Hsien Hsu,et al.  Energy-efficient hadoop for big data analytics and computing: A systematic review and research insights , 2017, Future Gener. Comput. Syst..

[23]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[24]  Chin-Teng Lin,et al.  A review of clustering techniques and developments , 2017, Neurocomputing.

[25]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[26]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[27]  Hasan Selim,et al.  A fuzzy multi-objective covering-based vehicle location model for emergency services , 2007, Comput. Oper. Res..

[28]  Mark J. Koetse,et al.  A choice experiment on alternative fuel vehicle preferences of private car owners in the Netherlands , 2014 .

[29]  Tan Wee Lee,et al.  Emerging Issues in Car Purchasing Decision , 2014 .

[30]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithm-Based Fuzzy Clustering of Categorical Attributes , 2009, IEEE Transactions on Evolutionary Computation.

[31]  L. Shapley A Value for n-person Games , 1988 .

[32]  Hai Jin,et al.  MR-scope: a real-time tracing tool for MapReduce , 2010, HPDC '10.

[33]  Zbigniew Michalewicz,et al.  Case study: an intelligent decision support system , 2005, IEEE Intelligent Systems.

[34]  M. Zima-Bočkarjova,et al.  Charging and Discharging Scheduling for Electrical Vehicles Using a Shapley-Value Approach , 2020 .

[35]  James C. Bezdek,et al.  An Efficient Formulation of the Improved Visual Assessment of Cluster Tendency (iVAT) Algorithm , 2012, IEEE Transactions on Knowledge and Data Engineering.

[36]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[37]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[38]  Eustache Mêgnigbêto,et al.  Modelling the Triple Helix of university-industry-government relationships with game theory: Core, Shapley value and nucleolus as indicators of synergy within an innovation system , 2018, J. Informetrics.

[39]  Y. Narahari,et al.  Novel Biobjective Clustering (BiGC) Based on Cooperative Game Theory , 2013, IEEE Transactions on Knowledge and Data Engineering.

[40]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[41]  Lihong Xu,et al.  Many-objective fuzzy centroids clustering algorithm for categorical data , 2018, Expert Syst. Appl..

[42]  Heiko Wersing,et al.  Personalization in advanced driver assistance systems and autonomous vehicles: A review , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[43]  Helen D. Karatza,et al.  Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark , 2017, J. Syst. Softw..

[44]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[45]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[46]  Amos Azaria,et al.  Computing the Shapley Value for Ride-Sharing and Routing Games , 2020, AAMAS.

[47]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[48]  Ujjwal Maulik,et al.  Integrating Clustering and Supervised Learning for Categorical Data Analysis , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[49]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Hai Jin,et al.  The MapReduce Programming Model and Implementations , 2011, CloudCom 2011.