Impurity Measurement in Selecting Decision Node Tree that Tolerate Noisy Cases

In a recent years, recommending an appropriate attribute of binary decision tree under unusual circumstances – such as training or testing with noisy attribute, has become more challenge in researching. Since, most of traditional impurity measurements have never been tested how much they can tolerate with encountered noisy cases. Consequently, this paper studies and proposes an impurity measurement which can be used to evaluate the goodness of binary decision tree node split under noisy situation, accurately. In order to make sure that the accuracy of decision tree classification by using the proposed measurement has been yet preserved, setting up an experiment to compare with the traditional impurity measures was conducted. And the result shows that accuracy of the proposed measurement in classifying a class under noisy case is acceptable.

[1]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[2]  Aritra Ghosh,et al.  Making risk minimization tolerant to label noise , 2014, Neurocomputing.

[3]  Simon Fong,et al.  Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning , 2011, DaWaK.

[4]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[5]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[6]  F. E. Grubbs,et al.  Extension of Sample Sizes and Percentage Points for Significance Tests of Outlying Observations , 1972 .

[7]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Aritra Ghosh,et al.  On the Robustness of Decision Tree Learning Under Label Noise , 2017, PAKDD.

[9]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[10]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Effect of label noise in the complexity of classification problems , 2015, Neurocomputing.

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.

[13]  Yannis Manolopoulos,et al.  Robust Classification Based on Correlations Between Attributes , 2007, Int. J. Data Warehous. Min..

[14]  Eric R. Ziegel,et al.  Mastering Data Mining , 2001, Technometrics.

[15]  Ravi Kothari,et al.  A new node splitting measure for decision tree construction , 2010, Pattern Recognit..

[16]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[17]  Jakramate Bootkrajang,et al.  A generalised label noise model for classification in the presence of annotation errors , 2016, Neurocomputing.

[18]  Simon Fong,et al.  Incrementally optimized decision tree for noisy big data , 2012, BigMine '12.