partDSA: deletion/substitution/addition algorithm for partitioning the covariate space in prediction

MOTIVATION Until now, much of the focus in cancer has been on biomarker discovery and generating lists of univariately significant genes, as well as epidemiological and clinical measures. These approaches, although significant on their own, are not effective for elucidating the synergistic qualities of the numerous components in complex diseases. These components do not act one at a time, but rather in concert with numerous others. A compelling need exists to develop analytically sound and computationally advanced methods that elucidate a more biologically meaningful understanding of the mechanisms of cancer initiation and progression by taking these interactions into account. RESULTS We propose a novel algorithm, partDSA, for prediction when several variables jointly affect the outcome. In such settings, piecewise constant estimation provides an intuitive approach by elucidating interactions and correlation patterns in addition to main effects. As well as generating 'and' statements similar to previously described methods, partDSA explores and chooses the best among all possible 'or' statements. The immediate benefit of partDSA is the ability to build a parsimonious model with 'and' and 'or' conjunctions that account for the observed biological phenomena. Importantly, partDSA is capable of handling categorical and continuous explanatory variables and outcomes. We evaluate the effectiveness of partDSA in comparison to several adaptive algorithms in simulations; additionally, we perform several data analyses with publicly available data and introduce the implementation of partDSA as an R package. AVAILABILITY http://cran.r-project.org/web/packages/partDSA/index.html CONTACT annette.molinaro@yale.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  S. Dudoit,et al.  Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples , 2003 .

[2]  W. Härdle Applied Nonparametric Regression , 1991 .

[3]  S. Dudoit,et al.  Tree-based multivariate regression and density estimation with right-censored data , 2004 .

[4]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[5]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[6]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[7]  Jae K. Lee,et al.  Statistical Bioinformatics: A Guide for Life and Biomedical Science Researchers , 2010 .

[8]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[9]  J. Robins,et al.  Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers , 1992 .

[10]  M. LeBlanc,et al.  Logic Regression , 2003 .

[11]  J. Friedman Multivariate adaptive regression splines , 1990 .

[12]  Mark J. van der Laan,et al.  Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms in Estimation , 2004 .

[13]  Meland,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[14]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[15]  James M. Robins,et al.  Coarsening at Random: Characterizations, Conjectures, Counter-Examples , 1997 .

[16]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .