Tree-Based Methods for Discovery of Association between Flow Cytometry Data and Clinical Endpoints

We demonstrate the application and comparative interpretations of three tree-based algorithms for the analysis of data arising from flow cytometry: classification and regression trees (CARTs), random forests (RFs), and logic regression (LR). Specifically, we consider the question of what best predicts CD4 T-cell recovery in HIV-1 infected persons starting antiretroviral therapy with CD4 count between 200 and 350 cell/μL. A comparison to a more standard contingency table analysis is provided. While contingency table analysis and RFs provide information on the importance of each potential predictor variable, CART and LR offer additional insight into the combinations of variables that together are predictive of the outcome. In all cases considered, baseline CD3-DR-CD56+CD16+ emerges as an important predictor variable, while the tree-based approaches identify additional variables as potentially informative. Application of tree-based methods to our data suggests that a combination of baseline immune activation states, with emphasis on CD8 T-cell activation, may be a better predictor than any single T-cell/innate cell subset analyzed. Taken together, we show that tree-based methods can be successfully applied to flow cytometry data to better inform and discover associations that may not emerge in the context of a univariate analysis.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[3]  M. J. Laan Statistical Inference for Variable Importance , 2006 .

[4]  Christophe Lalanne,et al.  Applied Statistical Genetics with R for Population-Based Association Studies , 2009 .

[5]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[6]  H. Ullum,et al.  Immune function and phenotype before and after highly active antiretroviral therapy. , 1999, Journal of acquired immune deficiency syndromes.

[7]  M. Segal,et al.  Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests , 2004, Statistical applications in genetics and molecular biology.

[8]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[9]  G. Janossy,et al.  Large‐scale affordable Panleucogated CD4+ testing with proactive internal and external quality assessment: In support of the South African national comprehensive care, treatment and management programme for HIV and AIDS , 2008, Cytometry. Part B, Clinical cytometry.

[10]  Ricardo Cao,et al.  Evaluating the Ability of Tree‐Based Methods and Logistic Regression for the Detection of SNP‐SNP Interaction , 2009, Annals of human genetics.

[11]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[12]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[13]  L. Montaner,et al.  Baseline Viral Load and Immune Activation Determine the Extent of Reconstitution of Innate Immune Effectors in HIV-1-Infected Subjects Undergoing Antiretroviral Treatment12 , 2007, The Journal of Immunology.

[14]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  M. LeBlanc,et al.  Logic Regression , 2003 .

[17]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[18]  John D. Storey A direct approach to false discovery rates , 2002 .

[19]  G. Trinchieri,et al.  Persistent Decreases in Blood Plasmacytoid Dendritic Cell Number and Function Despite Effective Highly Active Antiretroviral Therapy and Increased Blood Myeloid Dendritic Cells in HIV-Infected Individuals1 , 2002, The Journal of Immunology.

[20]  Andrea S. Foulkes,et al.  Applied Statistical Genetics with R: For Population-based Association Studies , 2009 .

[21]  R. D. Hatton,et al.  Interleukin 17–producing CD4+ effector T cells develop via a lineage distinct from the T helper type 1 and 2 lineages , 2005, Nature Immunology.

[22]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[23]  B. Scheithauer,et al.  Prognostic factors in gliomas. A multivariate analysis of clinical, pathologic, flow cytometric, cytogenetic, and molecular markers , 1994, Cancer.

[24]  G C Salzman,et al.  Classification and regression trees for bone marrow immunophenotyping. , 1995, Cytometry.

[25]  L Boddy,et al.  Pattern recognition in flow cytometry. , 2001, Cytometry.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  K. Ickstadt,et al.  Statistical Methods for Detecting Genetic Interactions: A Head and Neck Squamous-Cell Cancer Study , 2008, Journal of toxicology and environmental health. Part A.

[28]  Rachael Hughes,et al.  Long-term immunologic response to antiretroviral therapy in low-income countries: a collaborative analysis of prospective studies , 2008, AIDS.

[29]  Thomas Lumley,et al.  Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. , 2006, American journal of epidemiology.

[30]  A. S. Foulkes,et al.  Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data , 2004 .