ImbTreeAUC: An R package for building classification trees using the area under the ROC curve (AUC) on imbalanced datasets

Abstract In this paper, we propose a novel R package, named ImbTreeAUC, for building binary and multiclass decision tree using the area under the receiver operating characteristic (ROC) curve. The package provides nonstandard measures to select an optimal split point for an attribute as well as the optimal attribute for splitting through the application of local, semiglobal and global AUC measures. Additionally, ImbTreeAUC can handle imbalanced data, which is a challenging issue in many practical applications. The package supports cost-sensitive learning by defining a misclassification cost matrix and weight-sensitive learning. It accepts all types of attributes, including continuous, ordered and nominal attributes. The package and its code are made freely available.

[1]  Krzysztof Gajowniczek,et al.  Interactive Decision Tree Learning and Decision Rule Extraction Based on the ImbTreeEntropy and ImbTreeAUC Packages , 2021, Processes.

[2]  Jong-Seok Lee,et al.  AUC4.5: AUC-Based C4.5 Decision Tree Algorithm for Imbalanced Data Classification , 2019, IEEE Access.

[3]  Yang Xiang,et al.  Generalized Simulated Annealing for Global Optimization: The GenSA Package , 2013, R J..

[4]  James Bailey,et al.  ROC-tree: A Novel Decision Tree Induction Algorithm Based on Receiver Operating Characteristics to Classify Gene Expression Data , 2008, SDM.

[5]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[6]  Paul W. Fieguth,et al.  Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley & Sons, 2001, pp. xx + 654, ISBN: 0-471-05669-3 , 2007 .

[8]  Kent A. Spackman,et al.  Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning , 1989, ML.

[9]  Krzysztof Gajowniczek,et al.  ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets , 2021, Electronics.

[10]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[11]  Peter A. Flach,et al.  Learning Decision Trees Using the Area Under the ROC Curve , 2002, ICML.

[12]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[13]  Seref Sagiroglu,et al.  The development of intuitive knowledge classifier and the modeling of domain dependent data , 2013, Knowl. Based Syst..

[14]  Maya R. Gupta,et al.  Cost-sensitive multi-class classification from probability estimates , 2008, ICML '08.

[15]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[16]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[17]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[18]  Paul W. Fieguth,et al.  A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms , 2020, BMC Medical Informatics and Decision Making.

[19]  Jaime S. Cardoso,et al.  Transfer Learning with Partial Observability Applied to Cervical Cancer Screening , 2017, IbPRIA.

[20]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  Daniel B. Hier,et al.  A Neuro-ontology for the neurological examination , 2020, BMC Medical Informatics and Decision Making.

[23]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[24]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.