Kaggle LSHTC4 Winning Solution

Our winning submission to the 2014 Kaggle competition for Large Scale Hier-archical Text Classi cation (LSHTC) consists mostly of an ensemble of sparsegenerative models extending Multinomial Naive Bayes. The base-classi ers con-sist of hierarchically smoothed models combining document, label, and hierar-chy level Multinomials, with feature pre-processing using variants of TF-IDFand BM25. Additional diversi cation is introduced by di erent types of foldsand random search optimization for di erent measures. The ensemble algorithmoptimizes macroFscore by predicting the documents for each label, instead ofthe usual prediction of labels per document. Scores for documents are predictedby weighted voting of base-classi er outputs with a variant of Feature-WeightedLinear Stacking. The number of documents per label is chosen using label priorsand thresholding of vote scores.This document describes the models and software used to build our solution.Reproducing the results for our solution can be done by running the scripts in-cluded in the Kaggle package