Advances in Intelligent Data Analysis XI

Over-fitting is a ubiquitous problem in machine learning, and a variety of techniques to avoid over-fitting the training sample have proven highly effective, including early stopping, regularization, and ensemble methods. However, while over-fitting in training is widely appreciated and its avoidance now a standard element of best practice, over-fitting can also occur in model selection. This form of over-fitting can significantly degrade generalization performance, but has thus far received little attention. For example the kernel and regularization parameters of a support vector machine are often tuned by optimizing a cross-validation based model selection criterion. However the crossvalidation estimate of generalization performance will inevitably have a finite variance, such that its minimizer depends on the particular sample on which it is evaluated, and this will generally differ from the minimizer of the true generalization error. Therefore if the cross-validation error is aggressively minimized, generalization performance may be substantially degraded. In general, the smaller the amount of data available, the higher the variance of the model selection criterion, and hence the more likely over-fitting in model selection will be a significant problem. Similarly, the more hyper-parameters to be tuned in model selection, the more easily the variance of the model selection criterion can be exploited, which again increases the likelihood of over-fitting in model selection. Over-fitting in model selection is empirically demonstrated to pose a substantial pitfall in the application of kernel learning methods and Gaussian process classifiers. Furthermore, evaluation of machine learning methods can easily be significantly biased unless the evaluation protocol properly accounts for this type of over-fitting. Fortunately the common solutions to avoiding over-fitting in training also appear to be effective in avoiding over-fitting in model selection. Three examples are presented based on regularization of the model selection criterion, early stopping in model selection and minimizing the number of hyper-parameters to be tuned during model selection.

[1]  Malachy Carey Optimizing scheduled times, allowing for behavioural response , 1998 .

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Mark A. Turnquist,et al.  Service frequency, schedule reliability and passenger wait times at transit stops , 1981 .

[4]  Branko Kavsek,et al.  APRIORI-SD: ADAPTING ASSOCIATION RULE LEARNING TO SUBGROUP DISCOVERY , 2006, IDA.

[5]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Orlando Belo,et al.  Exploração de técnicas de classificação associativa no planeamento de horários de transportes públicos , 2009 .

[10]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[11]  Orlando Belo,et al.  A decision support system for timetable adjustments , 2009 .

[12]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[13]  Zhi-Hua Zhou,et al.  A k-nearest neighbor based algorithm for multi-label classification , 2005, 2005 IEEE International Conference on Granular Computing.

[14]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[15]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[16]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[17]  Geoff Holmes,et al.  Scalable and efficient multi-label classification for evolving data streams , 2012, Machine Learning.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[20]  Alípio Mário Jorge,et al.  Distribution Rules with Numeric Attributes of Interest , 2006, PKDD.

[21]  Alípio Mário Jorge,et al.  Comparing state-of-the-art regression methods for long term travel time prediction , 2012, Intell. Data Anal..

[22]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[23]  Lei Tang,et al.  Large scale multi-label classification via metalabeler , 2009, WWW '09.