Logistic Model Trees with AUC Split Criterion for the KDD Cup 2009 Small Challenge

In this work, we describe our approach to the "Small Challenge" of the KDD cup 2009, a classification task with incomplete data. Preprocessing, feature extraction and model selection are documented in detail. We suggest a criterion based on the number of missing values to select a suitable imputation method for each feature. Logistic Model Trees (LMT) are extended with a split criterion optimizing the Area under the ROC Curve (AUC), which was the requested evaluation criterion. By stacking boosted decision stumps and LMT we achieved the best result for the "Small Challenge" without making use of additional data from other feature sets, resulting in an AUC score of 0.8081. We also present results of an AUC optimizing model combination that scored only slightly worse with an AUC score of 0.8074.

[1]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[2]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[3]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[4]  Rich Caruana,et al.  Obtaining Calibrated Probabilities from Boosting , 2005, UAI.

[5]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[6]  Peter A. Flach,et al.  Improving the AUC of Probabilistic Estimation Trees , 2003, ECML.

[7]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[8]  Padhraic Smyth,et al.  Linearly Combining Density Estimators via Stacking , 1999, Machine Learning.

[9]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[10]  Peter A. Flach,et al.  Learning Decision Trees Using the Area Under the ROC Curve , 2002, ICML.

[11]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[12]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[13]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[14]  Ian T. Nabney,et al.  Netlab: Algorithms for Pattern Recognition , 2002 .

[15]  William H. Press,et al.  Numerical recipes in C , 2002 .

[16]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[17]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.