Extreme Multi Class Classification

We consider the multi class classification problem under the setting where the number of labels is very large and hence it is very desirable to efficiently achieve train and test running times which are logarithmic in the label complexity. Additionally the labels are feature dependent in our setting. We propose a reduction of this problem to a set of binary regression problems organized in a tree structure and we introduce a simple top-down criterion for purification of labels that allows for gradient descent style optimization. Furthermore we prove that maximizing the proposed objective function (splitting criterion) leads simultaneously to pure and balanced splits. We use the entropy of the tree leafs, a standard measure used in decision trees, to measure the quality of obtained tree and we show an upperbound on the number of splits required to reduce this measure below threshold . Finally we empirically show that the splits recovered by our algorithm leads to significantly smaller error than random splits.