Statistical Discretization of Continuous Attributes Using Kolmogorov-Smirnov Test

Unlike unsupervised discretization methods that use simple rules to discretize continuous attributes through a low time complexity which mostly depends on sorting procedure, supervised discretization algorithms take the class label of attributes into consideration to achieve high accuracy. Supervised discretization process on continuous features encounters two significant challenges. Firstly, noisy class labels affect the effectiveness of discretization. Secondly, due to the high computational time of supervised algorithms in large-scale datasets, time complexity would rely on discretizing stage rather than sorting procedure. Accordingly, to address the challenges, we devise a statistical unsupervised method named as SUFDA. The SUFDA aims to produce discrete intervals through decreasing differential entropy of the normal distribution with a low temporal complexity and high accuracy. The results show that our unsupervised system obtains a better effectiveness compared to other discretization baselines in large-scale datasets.