The use of data analysis competitions for selecting the most appropriate model for a problem is a recent innovation in the field of predictive machine learning. Two of the most well-known examples of this trend was the Netflix Competition and recently the competitions hosted on the online platform Kaggle.
In this paper, we will state and try to verify a set of qualitative hypotheses about predictive modelling, both in general and in the scope of data analysis competitions. To verify our hypotheses we will look at previous competitions and their outcomes, use qualitative interviews with top performers from Kaggle and use previous personal experiences from competing in Kaggle competitions.
The stated hypotheses about feature engineering, ensembling, overfitting, model complexity and evaluation metrics give indications and guidelines on how to select a proper model for performing well in a competition on Kaggle.
[1]
Kagan Tumer,et al.
Error Correlation and Error Reduction in Ensemble Classifiers
,
1996,
Connect. Sci..
[2]
Ck Cheng,et al.
The Age of Big Data
,
2015
.
[3]
Anthony Goldbloom,et al.
Data Prediction Competitions -- Far More than Just a Bit of Fun
,
2010,
2010 IEEE International Conference on Data Mining Workshops.
[4]
J. Manyika.
Big data: The next frontier for innovation, competition, and productivity
,
2011
.
[5]
Leo Breiman,et al.
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)
,
2001,
Statistical Science.