Model and Algorithm Selection in Statistical Learning and Optimization

Modern data-driven statistical techniques, e.g., non-linear classification and regression machine learning methods, play an increasingly important role in applied data analysis and quantitative research. For real-world we do not know a priori which methods will work best. Furthermore, most of the available models depend on so called hyperor control parameters, which can drastically influence their performance. This leads to a vast space of potential models, which cannot be explored exhaustively. Modern optimization techniques, often either evolutionary or modelbased, are employed to speed up this process. A very similar problem occurs in continuous and discrete optimization and, in general, in many other areas where problem instances are solved by algorithmic approaches: Many competing techniques exist, some of them heavily parametrized. Again, not much knowledge exists, how, given a certain application, one makes the correct choice here. These general problems are called algorithm selection and algorithm configuration. Instead of relying on tedious, manual trial-and-error, one should rather employ available computational power in a methodical fashion to obtain an appropriate algorithmic choice, while supporting this process with machinelearning techniques to discover and exploit as much of the search space structure as possible. In this cumulative dissertation I summarize nine papers that deal with the problem of model and algorithm selection in the areas of machine learning and optimization. Issues in benchmarking, resampling, efficient model tuning, feature selection and automatic algorithm selection are addressed and solved using modern techniques. I apply these methods to tasks from engineering, music data analysis and black-box optimization. The dissertation concludes by summarizing my published R packages for such tasks and specifically discusses two packages for parallelization on high performance computing clusters and parallel statistical experiments.

[1]  Bernd Bischl,et al.  Selecting Small Audio Feature Sets in Music Classification by Means of Asymmetric Mutation , 2010, PPSN.

[2]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[3]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[4]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[5]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Bernd Bischl,et al.  Tuning and evolution of support vector kernels , 2012, Evol. Intell..

[9]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[10]  Claus Weihs,et al.  Extending features for automatic speech recognition by means of auditory modelling , 2009, 2009 17th European Signal Processing Conference.

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[13]  Igor Vatolkin,et al.  Improving supervised music classification by means of multi-objective evolutionary feature selection , 2013 .

[14]  Gerhard Tutz,et al.  Localized classification , 2005, Stat. Comput..

[15]  Bernd Bischl,et al.  Exploratory landscape analysis , 2011, GECCO '11.

[16]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[17]  David S. Johnson,et al.  The Traveling Salesman Problem: A Case Study in Local Optimization , 2008 .

[18]  Gareth James,et al.  Variance and Bias for General Loss Functions , 2003, Machine Learning.

[19]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[20]  Nicola Beume,et al.  An EMO Algorithm Using the Hypervolume Measure as Selection Criterion , 2005, EMO.

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  Joaquin Vanschoren Understanding Machine Learning Performance with Experiment Databases (Het verwerven van inzichten in leerperformantie met experiment databanken) ; Understanding Machine Learning Performance with Experiment Databases , 2010 .

[23]  Bernd Bischl,et al.  Resampling Methods for Meta-Model Validation with Recommendations for Evolutionary Computation , 2012, Evolutionary Computation.

[24]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[25]  Bernd Bischl,et al.  A novel feature-based approach to characterize algorithm performance for the traveling salesperson problem , 2012, Annals of Mathematics and Artificial Intelligence.

[26]  Raymond Ros,et al.  Real-Parameter Black-Box Optimization Benchmarking 2009: Experimental Setup , 2009 .

[27]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[28]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[29]  Yoel Tenne,et al.  Metamodel accuracy assessment in evolutionary optimization , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[30]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[31]  Bernd Bischl,et al.  Analyzing the BBOB Results by Means of Benchmarking Concepts , 2015, Evolutionary Computation.

[32]  Bernd Bischl,et al.  Computing on high performance clusters with R : Packages BatchJobs and BatchExperiments Te ch ni ca lR ep or t , 2012 .

[33]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[34]  Jürgen Branke,et al.  Efficient search for robust solutions by means of evolutionary algorithms and fitness approximation , 2006, IEEE Transactions on Evolutionary Computation.

[35]  Bernd Bischl,et al.  Bias-Variance Analysis of Local Classification Methods , 2010, GfKl.

[36]  Yuri Malitsky,et al.  ISAC - Instance-Specific Algorithm Configuration , 2010, ECAI.

[37]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[38]  Bernd Bischl,et al.  Algorithm selection based on exploratory landscape analysis and cost-sensitive learning , 2012, GECCO '12.

[39]  Bernd Bischl,et al.  Huge Music Archives on Mobile Devices , 2011, IEEE Signal Processing Magazine.

[40]  Bernd Bischl,et al.  Selecting Groups of Audio Features by Statistical Tests and the Group Lasso , 2010, Sprachkommunikation.

[41]  Kevin Leyton-Brown,et al.  Auto-WEKA: Automated Selection and Hyper-Parameter Optimization of Classification Algorithms , 2012, ArXiv.

[42]  Hsuan-Tien Lin,et al.  One-sided Support Vector Regression for Multiclass Cost-sensitive Classification , 2010, ICML.