A Linear Time Partitioning Algorithm for Frequency Weighted Impurity Functions

Partitioning algorithms play a key role in machine learning, signal processing, and communications. They are used in many well-known NP-hard problems such as k-means clustering and vector quantization. The goodness of a partition scheme is measured by a given impurity function over the resulted partitions. The optimal partition is one(s) with the minimum impurity. Practical algorithms for finding an optimal partitioning are approximate, heuristic, and often assume certain properties of the given impurity function such as concavity/convexity. In this paper, we propose a heuristic, efficient (linear time) algorithm for finding the minimum impurity for a broader class of impurity functions which includes popular impurities such as Gini index and entropy. We also make a connection to a well-known result which states that the optimal partitions correspond to the regions separated by hyperplane cuts in the probability space of the posterior distribution.

[1]  Brian M. Kurkoski,et al.  Decoding LDPC codes with mutual information-maximizing lookup tables , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[2]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[3]  Eduardo Sany Laber,et al.  Minimization of Gini Impurity: NP-completeness and Approximation Algorithm via Connections with the k-means Problem , 2019, LAGOS.

[4]  Thuan Nguyen,et al.  On the Uniqueness of Binary Quantizers for Maximizing Mutual Information , 2020, ArXiv.

[5]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Thuan Nguyen,et al.  Entropy-Constrained Maximizing Mutual Information Quantization , 2020, ArXiv.

[7]  Thinh Nguyen,et al.  On the Capacities of Discrete Memoryless Thresholding Channels , 2018, 2018 IEEE 87th Vehicular Technology Conference (VTC Spring).

[8]  Ferdinando Cicalese,et al.  New results on information theoretic clustering , 2019, ICML.

[9]  Marco Molinaro,et al.  Binary Partitions with Approximate Minimum Impurity , 2018, ICML.

[10]  D. Burshtein,et al.  A Splitting Theorem For Tree Construction , 1991, Proceedings. 1991 IEEE International Symposium on Information Theory.

[11]  A. Nadas,et al.  An iterative 'flip-flop' approximation of the most informative split in the construction of decision trees , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Thuan Nguyen,et al.  Single-bit Quantization Capacity of Binary-input Continuous-output Channels , 2020, ArXiv.

[13]  Alexander Vardy,et al.  How to Construct Polar Codes , 2011, IEEE Transactions on Information Theory.

[14]  Thinh Nguyen,et al.  Minimizing Impurity Partition Under Constraints , 2019, ArXiv.

[15]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[16]  Brian M. Kurkoski,et al.  Quantization of Binary-Input Discrete Memoryless Channels , 2011, IEEE Transactions on Information Theory.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[19]  Thuan Nguyen,et al.  Optimal quantizer structure for binary discrete input continuous output channels under an arbitrary quantized-output constraint , 2020, ArXiv.

[20]  V. D. Pietra,et al.  Minimum Impurity Partitions , 1992 .

[21]  Jonathan R. M. Hosking,et al.  Partitioning Nominal Attributes in Decision Trees , 1999, Data Mining and Knowledge Discovery.