Our winning submission to the 2014 Kaggle competition for Large Scale Hier-archical Text Classi cation (LSHTC) consists mostly of an ensemble of sparsegenerative models extending Multinomial Naive Bayes. The base-classi ers con-sist of hierarchically smoothed models combining document, label, and hierar-chy level Multinomials, with feature pre-processing using variants of TF-IDFand BM25. Additional diversi cation is introduced by di erent types of foldsand random search optimization for di erent measures. The ensemble algorithmoptimizes macroFscore by predicting the documents for each label, instead ofthe usual prediction of labels per document. Scores for documents are predictedby weighted voting of base-classi er outputs with a variant of Feature-WeightedLinear Stacking. The number of documents per label is chosen using label priorsand thresholding of vote scores.This document describes the models and software used to build our solution.Reproducing the results for our solution can be done by running the scripts in-cluded in the Kaggle package
[1]
Antti Puurula,et al.
Combining Modifications to Multinomial Naive Bayes for Text Classification
,
2012,
AIRS.
[2]
Ian H. Witten,et al.
The WEKA data mining software: an update
,
2009,
SKDD.
[3]
Antti Puurula.
Scalable Text Classification with Sparse Generative Modeling
,
2012,
PRICAI.
[4]
Sung-Hyon Myaeng,et al.
Integrated instance- and class-based generative modeling for text classification
,
2013,
ADCS.
[5]
Antti Puurula.
Cumulative Progress in Language Models for Information Retrieval
,
2013,
ALTA.
[6]
Tie-Yan Liu,et al.
Information Retrieval Technology
,
2014,
Lecture Notes in Computer Science.
[7]
Joseph Sill,et al.
Feature-Weighted Linear Stacking
,
2009,
ArXiv.