SVM–CART for disease classification

ABSTRACT Classification and regression trees (CART) and support vector machines (SVM) have become very popular statistical learning tools for analyzing complex data that often arise in biomedical research. While both CART and SVM serve as powerful classifiers in many clinical settings, there are some common scenarios in which each fails to meet the performance and interpretability needed for use as a clinical decision-making tool. In this paper, we propose a new classification method, SVM–CART, that combines features of SVM and CART to produce a more flexible classifier that has the potential to outperform either method in terms of interpretability and prediction accuracy. Furthermore, to enhance prediction accuracy we provide extensions of a single SVM–CART to an ensemble, and methods to extract a representative classifier from the SVM–CART ensemble. The goal is to produce a decision-making tool that can be used in the clinical setting, while still harnessing the stability and predictive improvements gained through developing the SVM–CART ensemble. An extensive simulation study is conducted to assess the performance of the methods in various settings. Finally, we illustrate our methods using a clinical neuropathy dataset.

[1]  Min Zhu,et al.  A Hybrid Approach to Combining CART and Logistic Regression for Stock Ranking , 2011, The Journal of Portfolio Management.

[2]  Claudio Conversano,et al.  Combining an Additive and Tree-Based Regression Model Simultaneously: STIMA , 2010 .

[3]  Trevor Hastie,et al.  Tree-Based Methods , 2021, Springer Texts in Statistics.

[4]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[5]  Achim Zeileis,et al.  Generalised linear model trees with global additive effects , 2016, Advances in Data Analysis and Classification.

[6]  Jingbin Huang,et al.  Diagnostic Method of Diabetes Based on Support Vector Machine and Tongue Images , 2017, BioMed research international.

[7]  Dianne Cook,et al.  PPtree: Projection pursuit classification tree , 2013 .

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  M. Banerjee,et al.  Metabolic Syndrome Components Are Associated With Symptomatic Polyneuropathy Independent of Glycemic Status , 2016, Diabetes Care.

[10]  W. Loh,et al.  LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees , 2004 .

[11]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[12]  A. Bharucha,et al.  Prevalence of peripheral neuropathy in the Parsi community of Bombay , 1991, Neurology.

[13]  W. Rocca,et al.  Prevalence of diabetic neuropathy with somatic symptoms , 1993, Neurology.

[14]  Yea-Ing Lotus Shyu,et al.  Combining logistic regression with classification and regression tree to predict quality of care in a home health nursing data set. , 2006, Studies in health technology and informatics.

[15]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[16]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[17]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[18]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[19]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[20]  A. D. de Leon,et al.  Classification with discrete and continuous variables via general mixed-data models , 2011 .

[21]  M. Banerjee,et al.  Association Between Metabolic Syndrome Components and Polyneuropathy in an Obese Population. , 2016, JAMA neurology.

[22]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .

[23]  Hung-Wen Chiu,et al.  Cancer subtype prediction from a pathway-level perspective by using a support vector machine based on integrated gene expression and protein network , 2017, Comput. Methods Programs Biomed..

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  Hansheng Wang,et al.  Subgroup Analysis via Recursive Partitioning , 2009, J. Mach. Learn. Res..

[28]  Mousumi Banerjee,et al.  Identifying representative trees from ensembles , 2012, Statistics in medicine.

[29]  H. Ishwaran,et al.  Relative Risk Forests for Exercise Heart Rate Recovery as a Predictor of Mortality , 2004 .

[30]  Mousumi Banerjee,et al.  The role of pancreatic stellate cells in pancreatic disorders , 2016 .