Regression Phalanxes

Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction. We propose a novel algorithm which automatically chooses Regression Phalanxes from high-dimensional data sets using hierarchical clustering and builds a prediction model for each phalanx for further ensembling. Through extensive simulation studies and several real-life applications in various areas (including drug discovery, chemical analysis of spectra data, microarray analysis and climate projections) we show that an ensemble of Regression Phalanxes improves prediction accuracy when combined with effective prediction methods like Lasso or Random Forests.

[1]  W. Welch,et al.  Ensembling classification models based on phalanxes of variables with applications in drug discovery , 2013, 1303.4805.

[2]  Peter E. Thornton,et al.  DIMENSIONALITY REDUCTION FOR COMPLEX MODELS VIA BAYESIAN COMPRESSIVE SENSING , 2014 .

[3]  Douglas M. Hawkins,et al.  ChemModLab: A Web-Based Cheminformatics Modeling Laboratory , 2012, Silico Biol..

[4]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[5]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[6]  Jun Feng,et al.  PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation , 2005, J. Chem. Inf. Model..

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[9]  Pascal Lemberge,et al.  Quantitative analysis of 16–17th century archaeological glass vessels using PLS regression of EPXMA and µ‐XRF data , 2000 .

[10]  Tony Davies,et al.  Multivariate Analysis in Practice, a Training Package , 1996 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  F. Burden Molecular identification number for substructure searches , 1989, J. Chem. Inf. Comput. Sci..

[13]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..