Sélection de groupes de variables corrélées en grande dimension

Le contexte de cette these est la selection de variables en grande dimension a l'aide de procedures de regression regularisee en presence de redondance entre variables explicatives. Parmi les variables candidates, on suppose que seul un petit nombre est reellement pertinent pour expliquer la reponse. Dans ce cadre de grande dimension, les approches classiques de type Lasso voient leurs performances se degrader lorsque la redondance croit, puisqu'elles ne tiennent pas compte de cette derniere. Regrouper au prealable ces variables peut pallier ce defaut, mais necessite usuellement la calibration de parametres supplementaires. L'approche proposee combine regroupement et selection de variables dans un souci d'interpretabilite et d'amelioration des performances. D'abord une Classification Ascendante Hierarchique (CAH) fournit a chaque niveau une partition des variables en groupes. Puis le Group-lasso est utilise a partir de l'ensemble des groupes de variables des differents niveaux de la CAH a parametre de regularisation fixe. Choisir ce dernier fournit alors une liste de groupe candidats issus potentiellement de differents niveaux. Le choix final des groupes est obtenu via une procedure de tests multiples. La procedure proposee exploite la structure hierarchique de la CAH et des ponderations dans le Group-lasso. Cela permet de reduire considerablement la complexite algorithmique induite par la flexibilite.

[1]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[2]  Terence P. Speed,et al.  A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6 , 2009, Bioinform..

[3]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[4]  Mohamed-Jalal Fadili,et al.  The Degrees of Freedom of the Group Lasso , 2012, ICML 2012.

[5]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[6]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[7]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[8]  Holger Hoefling A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Wei Sun,et al.  Consistent selection of tuning parameters via variable selection stability , 2012, J. Mach. Learn. Res..

[11]  Jelle J. Goeman,et al.  Testing against a high-dimensional alternative in the generalized linear model: asymptotic type I error control , 2011 .

[12]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[13]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[14]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[15]  Xiaotong Shen,et al.  Model selection procedure for high‐dimensional data , 2010, Stat. Anal. Data Min..

[16]  Christophe Ambroise,et al.  Sparsity by Worst-Case Quadratic Penalties , 2012 .

[17]  H. Akaike A new look at the statistical model identification , 1974 .

[18]  Nicola Zamboni,et al.  Transient expression and flux changes during a shift from high to low riboflavin production in continuous cultures of Bacillus subtilis. , 2005, Biotechnology and bioengineering.

[19]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[20]  James H. Bullard,et al.  aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory , 2008 .

[21]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[22]  Peter Langfelder,et al.  Fast R Functions for Robust Correlations and Hierarchical Clustering. , 2012, Journal of statistical software.

[23]  William Stafford Noble,et al.  Kernel hierarchical gene clustering from microarray expression data , 2003, Bioinform..

[24]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[25]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[26]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[27]  Yi Yang,et al.  A fast unified algorithm for solving group-lasso penalize learning problems , 2014, Statistics and Computing.

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  M. Ringnér,et al.  Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays , 2008, Genome Biology.

[30]  R. Tibshirani,et al.  Clustering methods for the analysis of DNA microarray data , 1999 .

[31]  Yaohui Zeng,et al.  Overlapping Group Logistic Regression with Applications to Genetic Pathway Selection , 2015, Cancer informatics.

[32]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[33]  H. Bondell,et al.  Simultaneous regression shrinkage , variable selection and clustering of predictors with OSCAR , 2006 .

[34]  Fan Zhang,et al.  The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping , 2014, Technometrics.

[35]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[36]  Jieping Ye,et al.  Multi-Layer Feature Reduction for Tree Structured Group Lasso via Hierarchical Projection , 2015, NIPS.

[37]  Wessel N. van Wieringen,et al.  CGHcall: Calling aberrations for array CGH tumor profiles. , 2008 .

[38]  Terence P. Speed,et al.  TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays , 2010, BMC Bioinformatics.

[39]  Idris A. Eckley,et al.  changepoint: An R Package for Changepoint Analysis , 2014 .

[40]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[41]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[42]  Guillem Rigaill,et al.  Pruned dynamic programming for optimal multiple change-point detection , 2010 .

[43]  O. J. Dunn Estimation of the Medians for Dependent Variables , 1959 .

[44]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[45]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[46]  H. Binder,et al.  A coordinate-wise optimization algorithm for the Fused Lasso , 2010, 1011.6409.

[47]  T. LaFramboise,et al.  Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances , 2009, Nucleic acids research.

[48]  D. Altman,et al.  Multiple significance tests: the Bonferroni method , 1995, BMJ.

[49]  R. Tibshirani The Lasso Problem and Uniqueness , 2012, 1206.0313.

[50]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[51]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[52]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[53]  S. Geer,et al.  Correlated variables in regression: Clustering and sparse estimation , 2012, 1209.5908.

[54]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[55]  Ryan J. Tibshirani,et al.  Efficient Implementations of the Generalized Lasso Dual Path Algorithm , 2014, ArXiv.

[56]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[57]  Han Liu,et al.  Estimation Consistency of the Group Lasso and its Applications , 2009, AISTATS.

[58]  M. Cugmas,et al.  On comparing partitions , 2015 .

[59]  Franck Picard,et al.  A statistical approach for array CGH data analysis , 2005, BMC Bioinformatics.

[60]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[61]  N. Meinshausen Hierarchical testing of variable importance , 2008 .

[62]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[63]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[64]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[65]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[66]  Francis R. Bach,et al.  Learning smoothing models of copy number profiles using breakpoint annotations , 2013, BMC Bioinformatics.

[67]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[68]  Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping , 2009, 0909.1373.

[69]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[70]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[71]  Sara van de Geer,et al.  Testing against a high dimensional alternative , 2006 .

[72]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[73]  Robert Tibshirani,et al.  STANDARDIZATION AND THE GROUP LASSO PENALTY. , 2012, Statistica Sinica.

[74]  Quentin Grimonprez,et al.  MPAgenomics: an R package for multi-patient analysis of genomic markers , 2014, BMC Bioinformatics.

[75]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[76]  P. Massart,et al.  Minimal Penalties for Gaussian Model Selection , 2007 .

[77]  Bin Yu,et al.  High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence , 2008, 0811.3628.

[78]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[79]  Emilie Lebarbier,et al.  Detecting multiple change-points in the mean of Gaussian process by model selection , 2005, Signal Process..

[80]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[81]  Pierre Morizet-Mahoudeaux,et al.  Hierarchical Penalization , 2007, NIPS.

[82]  S. Saha,et al.  RNA Expression Analysis Using an AntisenseBacillus subtilis Genome Array , 2001, Journal of bacteriology.

[83]  D. Yekutieli Hierarchical False Discovery Rate–Controlling Methodology , 2008 .

[84]  Peter Bühlmann,et al.  High-Dimensional Statistics with a View Toward Applications in Biology , 2014 .

[85]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[86]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[87]  Sylvie Huet,et al.  Gaussian model selection with an unknown variance , 2007, math/0701250.

[88]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[89]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[90]  Sébastien Lê,et al.  FactoMineR: An R Package for Multivariate Analysis , 2008 .

[91]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.