Optimal SVM parameter selection for non-separable and unbalanced datasets

This article presents a study of three validation metrics used for the selection of optimal parameters of a support vector machine (SVM) classifier in the case of non-separable and unbalanced datasets. This situation is often encountered when the data is obtained experimentally or clinically. The three metrics selected in this work are the area under the ROC curve (AUC), accuracy, and balanced accuracy. These validation metrics are tested using computational data only, which enables the creation of fully separable sets of data. This way, non-separable datasets, representative of a real-world problem, can be created by projection onto a lower dimensional sub-space. The knowledge of the separable dataset, unknown in real-world problems, provides a reference to compare the three validation metrics using a quantity referred to as the “weighted likelihood”. As an application example, the study investigates a classification model for hip fracture prediction. The data is obtained from a parameterized finite element model of a femur. The performance of the various validation metrics is studied for several levels of separability, ratios of unbalance, and training set sizes.

[1]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[2]  Kenneth Holmström,et al.  Global Optimization Using the DIRECT Algorithm in Matlab , 1999 .

[3]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[4]  Ole Winther,et al.  Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[5]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[6]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[7]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[8]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[11]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[12]  Manuel Doblaré,et al.  On the modelling bone tissue fracture and healing of the bone tissue. , 2003, Acta cientifica venezolana.

[13]  F. Tay,et al.  Application of support vector machines in financial time series forecasting , 2001 .

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  V. Vapnik,et al.  Bounds on Error Expectation for Support Vector Machines , 2000, Neural Computation.

[16]  Antonio Harrison Sánchez,et al.  Limit state function identification using Support Vector Machines for discontinuous responses and disjoint failure domains , 2008 .

[17]  G. Niebur,et al.  Comparison of the elastic and yield properties of human femoral trabecular and cortical bone tissue. , 2004, Journal of biomechanics.

[18]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[19]  A. Basudhar,et al.  An improved adaptive sampling scheme for the construction of explicit boundaries , 2010 .

[20]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[21]  Wu Meng,et al.  Application of Support Vector Machines in Financial Time Series Forecasting , 2007 .

[22]  I R König,et al.  Patient-centered yes/no prognosis using learning machines , 2008, Int. J. Data Min. Bioinform..

[23]  M. Viceconti,et al.  Accuracy of finite element predictions in sideways load configurations for the proximal human femur. , 2012, Journal of Biomechanics.

[24]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[25]  Zheng Rong Yang,et al.  Machine Learning Approaches to Bioinformatics , 2010, Science, Engineering, and Biology Informatics.

[26]  M. Narasimha Murty,et al.  Pattern Recognition - An Algorithmic Approach , 2011, Undergraduate Topics in Computer Science.

[27]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .