Identifying representative trees from ensembles

Tree-based methods have become popular for analyzing complex data structures where the primary goal is risk stratification of patients. Ensemble techniques improve the accuracy in prediction and address the instability in a single tree by growing an ensemble of trees and aggregating. However, in the process, individual trees get lost. In this paper, we propose a methodology for identifying the most representative trees in an ensemble on the basis of several tree distance metrics. Although our focus is on binary outcomes, the methods are applicable to censored data as well. For any two trees, the distance metrics are chosen to (1) measure similarity of the covariates used to split the trees; (2) reflect similar clustering of patients in the terminal nodes of the trees; and (3) measure similarity in predictions from the two trees. Whereas the latter focuses on prediction, the first two metrics focus on the architectural similarity between two trees. The most representative trees in the ensemble are chosen on the basis of the average distance between a tree and all other trees in the ensemble. Out-of-bag estimate of error rate is obtained using neighborhoods of representative trees. Simulations and data examples show gains in predictive accuracy when averaging over such neighborhoods. We illustrate our methods using a dataset of kidney cancer treatment receipt (binary outcome) and a second dataset of breast cancer survival (censored outcome).

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  Xiaochun Li,et al.  High-Dimensional Data Analysis in Cancer Research , 2009 .

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Edward I. George,et al.  Managing Multiple Models , 2001, AISTATS.

[5]  Atanu Biswas et al. Statistical advances in the biomedical sciences , 2013 .

[6]  S. Keleş,et al.  Residual‐based tree‐structured survival analysis , 2002, Statistics in medicine.

[7]  M. Banerjee,et al.  Recursive partitioning for prognostic grouping of patients with clinically localized prostate carcinoma , 2000, Cancer.

[8]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[9]  A. Hanlon,et al.  Recursive partitioning identifies patients at high and low risk for ipsilateral tumor recurrence after breast-conserving surgery and radiation. , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[10]  Y. Laberge Handbook of statistics in clinical oncology , 2013 .

[11]  W. Shannon,et al.  Combining classification trees using MLE. , 1999, Statistics in medicine.

[12]  References , 1971 .

[13]  M. LeBlanc,et al.  Relative risk trees for censored survival data. , 1992, Biometrics.

[14]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[15]  John R. Stevens,et al.  Tree-Based Methods , 2009 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Udaya B. Kogalur,et al.  Random Survival Forests for R , 2007 .

[19]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[20]  Mousumi Banerjee,et al.  Tree-based model for breast cancer prognostication. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[21]  M. Segal,et al.  Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests , 2004, Statistical applications in genetics and molecular biology.

[22]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[23]  J. Sterne,et al.  Development and validation of a prognostic model for survival time data: application to prognosis of HIV positive patients treated with antiretroviral therapy , 2004, Statistics in medicine.

[24]  Mousumi Banerjee,et al.  Tree‐Based Methods for Survival Data , 2007 .

[25]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[26]  S. Singletary,et al.  Recursive partitioning analysis of locoregional recurrence patterns following mastectomy: implications for adjuvant irradiation. , 2001, International journal of radiation oncology, biology, physics.

[27]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[28]  H. Ishwaran,et al.  Relative Risk Forests for Exercise Heart Rate Recovery as a Predictor of Mortality , 2004 .