Distributed AdaBoost Extensions for Cost-sensitive Classification Problems

In data mining, classification of data has always been an area of interest and this is especially true after the rapid increase in availability of data being collected. Cost-sensitive classification is a subset of the broader classification problem where the focus is on solving the class imbalance problem. This paper addresses the class imbalance problem using Cost-sensitive Distributed Boosting (CsDb). CsDb is a meta-classifier designed to solve the class imbalance problem for big data, is based on the concept of MapReduce. The focus of this work is to solve the class imbalance problem for the size of data which is beyond the capacity of standalone commodity hardware to handle. CsDb solves the classification problems by learning models in a distributed environment. Empirical evaluation of CsDb carried over datasets from different application domains shows average reduction of misclassification cost and number of high cost errors by 21.06% and 30.15% respectively with respect to its predecessors of type error based classifier. It preserves the cost-sensitivity of cost based predecessor. While it preserves the accuracy and F1-score, the model building time is reduced by 90.14% as compared to a non-distributed cost-sensitive classifier.

[1]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[2]  Jeff Cooper,et al.  Improved algorithms for distributed boosting , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Sanjay Chaudhary,et al.  An Empirical evaluation of CostBoost Extensions for Cost-Sensitive Classification , 2015, Compute '15.

[5]  Sunil Vadera,et al.  A survey of cost-sensitive decision tree induction algorithms , 2013, CSUR.

[6]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[7]  Sanjay Chaudhary,et al.  Distributed decision tree v.2.0 , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[8]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[9]  Nitesh V. Chawla,et al.  A parallel decision tree builder for mining very large visualization datasets , 2000, SMC.

[10]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[11]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[12]  Kai Ming Ting,et al.  Boosting Cost-Sensitive Trees , 1998, Discovery Science.

[13]  Ausif Mahmood,et al.  Highly Scalable, Parallel and Distributed AdaBoost Algorithm using Light Weight Threads and Web Services on a Network of Multi-Core Machines , 2013, ArXiv.

[14]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[17]  Ankit Desai,et al.  An Empirical Evaluation of Adaboost Extensions for Cost-Sensitive Classification , 2012 .

[18]  Zoran Obradovic,et al.  Boosting Algorithms for Parallel and Distributed Learning , 2022 .

[19]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[20]  Sanjay Chaudhary,et al.  Distributed Decision Tree , 2016, COMPUTE.