Variable Selection through Correlation Sifting

Many applications of computational biology require a variable selection procedure to sift through a large number of input variables and select some smaller number that influence a target variable of interest. For example, in virology, only some small number of viral protein fragments influence the nature of the immune response during viral infection. Due to the large number of variables to be considered, a brute-force search for the subset of variables is in general intractable. To approximate this, methods based on l1-regularized linear regression have been proposed and have been found to be particularly successful. It is well understood however that such methods fail to choose the correct subset of variables if these are highly correlated with other "decoy" variables. We present a method for sifting through sets of highly correlated variables which leads to higher accuracy in selecting the correct variables. The main innovation is a filtering step that reduces correlations among variables to be selected, making the l1-regularization effective for datasets on which many methods for variable selection fail. The filtering step changes both the values of the predictor variables and output values by projections onto components obtained through a computationally-inexpensive principal components analysis. In this paper we demonstrate the usefulness of our method on synthetic datasets and on novel applications in virology. These include HIV viral load analysis based on patients' HIV sequences and immune types, as well as the analysis of seasonal variation in influenza death rates based on the regions of the influenza genome that undergo diversifying selection in the previous season.

[1]  I. Jolliffe A Note on the Use of Principal Components in Regression , 1982 .

[2]  R. Phillips,et al.  Novel, cross-restricted, conserved, and immunodominant cytotoxic T lymphocyte epitopes in slow progressors in HIV type 1 infection. , 1996, AIDS research and human retroviruses.

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  V. Calvez,et al.  Dynamics of HIV-Specific CD8+ T Lymphocytes with Changes in Viral Load1 , 2000, The Journal of Immunology.

[5]  C. Moore,et al.  Evidence of HIV-1 Adaptation to HLA-Restricted Immune Responses at a Population Level , 2002, Science.

[6]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[7]  M. Altfeld,et al.  Immune Selection for Altered Antigen Processing Leads to Cytotoxic T Lymphocyte Escape in Chronic HIV-1 Infection , 2004, The Journal of experimental medicine.

[8]  M. Nowak,et al.  Determinants of Human Immunodeficiency Virus Type 1 Escape from the Primary CD8+ Cytotoxic T Lymphocyte Response , 2004, The Journal of experimental medicine.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  Todd M. Allen,et al.  HLA-B63 Presents HLA-B57/B58-Restricted Cytotoxic T-Lymphocyte Epitopes and Is Associated with Low Human Immunodeficiency Virus Load , 2005, Journal of Virology.

[11]  R. Webster,et al.  The polymerase complex genes contribute to the high virulence of the human H5N1 influenza virus isolate A/Vietnam/1203/04 , 2006, The Journal of experimental medicine.

[12]  M. Wilson,et al.  Patterns of influenza-associated mortality among US elderly by geographic region and virus subtype, 1968-1998. , 2006, American journal of epidemiology.

[13]  Elizabeth C. Theil,et al.  Epochal Evolution Shapes the Phylodynamics of Interpandemic Influenza A (H3N2) in Humans , 2006, Science.

[14]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[15]  Ora Schueler-Furman,et al.  Learning MHC I - peptide binding , 2006, ISMB.

[16]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[17]  T. Tatusova,et al.  The Influenza Virus Resource at the National Center for Biotechnology Information , 2007, Journal of Virology.

[18]  N. Daigle,et al.  Structure and nuclear import function of the C-terminal domain of influenza virus polymerase PB2 subunit , 2007, Nature Structural &Molecular Biology.

[19]  R. Tibshirani,et al.  "Preconditioning" for feature selection and regression in high-dimensional problems , 2007, math/0703858.

[20]  D. Nixon,et al.  Sequential Broadening of CTL Responses in Early HIV-1 Infection Is Associated with Viral Escape , 2007, PloS one.

[21]  E. Obayashi,et al.  The structural basis for an essential subunit interaction in influenza virus RNA polymerase , 2008, Nature.

[22]  Joshua N. Adkins,et al.  Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition , 2009, Bioinform..

[23]  Jonathan E. Allen,et al.  Conserved amino acid markers from past influenza pandemic strains , 2009, BMC Microbiology.

[24]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.