Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients

Varying-coefficient models are widely used to model nonparametric interaction and recently adopted to analyze longitudinal data measured repeatedly over time. We focus on high-dimensional longitudinal observations in this article. A novel two-step sparse boosting approach is proposed to carry out the variable selection and the model-based prediction. As a new machine learning tool, boosting provides seamless integration of model estimation and variable selection for complicated regression functions. Specifically, in the first step the sparse boosting technique assuming independence is applied to facilitate an initial estimate of the correlation structure while in the second step the estimated correlation structure is incorporated in the loss function of the sparse boosting algorithm. Extensive numerical examples illustrate the advantage of the two-step sparse boosting method. An application of yeast cell cycle gene expression data is further provided to demonstrate the proposed methodology.

[1]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[2]  Hui Zou,et al.  Nonparametric multiple expectile regression via ER-Boost , 2015 .

[3]  Yue Mu,et al.  Improvement Screening for Ultra-High Dimensional Data with Censored Survival Outcomes and Varying Coefficients. , 2017 .

[4]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[5]  Heejung Bang,et al.  A varying‐coefficient model for the evaluation of time‐varying concomitant intervention effects in longitudinal studies , 2008, Statistics in medicine.

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  Harrison H. Zhou,et al.  Optimal rates of convergence for covariance matrix estimation , 2010, 1010.3866.

[8]  Naisyin Wang Marginal nonparametric kernel regression accounting for within‐subject correlation , 2003 .

[9]  Anil K. Ghosh,et al.  A nonparametric two-sample test applicable to high dimensional data , 2014, J. Multivar. Anal..

[10]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[11]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[12]  M. Spector,et al.  Hir1p and Hir2p function as transcriptional corepressors to regulate histone gene transcription in the Saccharomyces cerevisiae cell cycle , 1997, Molecular and cellular biology.

[13]  Ming Yuan,et al.  High Dimensional Inverse Covariance Matrix Estimation via Linear Programming , 2010, J. Mach. Learn. Res..

[14]  M. Yuan,et al.  Adaptive covariance matrix estimation through block thresholding , 2012, 1211.0459.

[15]  Jianqing Fan,et al.  High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.

[16]  Torsten Hothorn,et al.  Twin Boosting: improved feature selection and prediction , 2010, Stat. Comput..

[17]  Anil K. Ghosh,et al.  On high dimensional two-sample tests based on nearest neighbors , 2015, J. Multivar. Anal..

[18]  Jialiang Li,et al.  Efficient estimation in semivarying coefficient models for longitudinal/clustered data , 2015, 1501.00538.

[19]  Zhu Wang,et al.  HingeBoost: ROC-Based Boost for Classification and Variable Selection , 2011 .

[20]  Sanjeev R. Kulkarni,et al.  Convergence and Consistency of Regularized Boosting With Weakly Dependent Observations , 2014, IEEE Transactions on Information Theory.

[21]  Jianqing Fan,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models , 2014, Journal of the American Statistical Association.

[22]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[23]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[24]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[25]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[26]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[27]  Shinto Eguchi,et al.  A boosting method for maximizing the partial area under the ROC curve , 2010, BMC Bioinformatics.

[28]  Shuangge Ma,et al.  Sparse boosting for high‐dimensional survival data with varying coefficients , 2018, Statistics in medicine.

[29]  Xiangrong Yin,et al.  Trending time-varying coefficient market models , 2012 .

[30]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[31]  T. Lai,et al.  A STEPWISE REGRESSION METHOD AND CONSISTENT MODEL SELECTION FOR HIGH-DIMENSIONAL SPARSE LINEAR MODELS , 2011 .

[32]  A. Shevchenko,et al.  Forkhead transcription factors, Fkh1p and Fkh2p, collaborate with Mcm1p to control transcription required for M-phase , 2000, Current Biology.

[33]  Z Wang Multi-class HingeBoost. Method and application to the classification of cancer types using gene expression data. , 2012, Methods of information in medicine.

[34]  M. Dehmer,et al.  Analysis of Microarray Data: A Network-Based Approach , 2008 .

[35]  Henry Horng-Shing Lu,et al.  Statistical methods for identifying yeast cell cycle transcription factors. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Michael Q. Zhang,et al.  Identifying cooperativity among transcription factors controlling the cell cycle in yeast. , 2003, Nucleic acids research.

[37]  Jialiang Li,et al.  Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis , 2016, Lifetime data analysis.

[38]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[39]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[40]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[41]  Runze Li,et al.  Analysis of Longitudinal Data With Semiparametric Estimation of Covariance Function , 2007, Journal of the American Statistical Association.

[42]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[43]  M. L. Eaton,et al.  The asymptotic distribution of singular values with applications to canonical correlations and correspondence analysis , 1994 .

[44]  Lukas Endler,et al.  Forkhead-like transcription factors recruit Ndd1 to the chromatin of G2/M-specific promoters , 2000, Nature.

[45]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[46]  Dimitris Rizopoulos Event Time Event Time , 2012 .

[47]  P. Bühlmann Boosting for high-dimensional linear models , 2006 .

[48]  Jianqing Fan,et al.  NETWORK EXPLORATION VIA THE ADAPTIVE LASSO AND SCAD PENALTIES. , 2009, The annals of applied statistics.

[49]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[50]  Hongzhe Li,et al.  Clustering of time-course gene expression data using a mixed-effects model with B-splines , 2003, Bioinform..

[51]  Semiparametric Varying Coefficient Models with Endogenous Covariates , 2016 .

[52]  Jian Huang,et al.  VARIABLE SELECTION AND ESTIMATION IN HIGH-DIMENSIONAL VARYING-COEFFICIENT MODELS. , 2011, Statistica Sinica.

[53]  Jialiang Li,et al.  Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data , 2013, 1308.3942.

[54]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[55]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[56]  Hongzhe Li,et al.  Group SCAD regression analysis for microarray time course gene expression data , 2007, Bioinform..

[57]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[58]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[59]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[60]  Jianhua Z. Huang,et al.  Variable Selection in Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements , 2008, Journal of the American Statistical Association.

[61]  C. Giraud Introduction to High-Dimensional Statistics , 2014 .

[62]  K. Nasmyth,et al.  A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase. , 1993, Science.

[63]  Yingcun Xia,et al.  Shrinkage Estimation of the Varying Coefficient Model , 2008 .

[64]  R. Carroll,et al.  Semiparametric Regression for Clustered Data Using Generalized Estimating Equations , 2001 .

[65]  A Ziegler,et al.  Discussion of “The Evolution of Boosting Algorithms” and “Extending Statistical Boosting” , 2014, Methods of Information in Medicine.

[66]  David Lydall,et al.  NDD1, a High-Dosage Suppressor ofcdc28-1N, Is Essential for Expression of a Subset of Late-S-Phase-Specific Genes in Saccharomyces cerevisiae , 1999, Molecular and Cellular Biology.

[67]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[68]  B. Peng,et al.  Variable Selection for a Categorical Varying-Coefficient Model with Identifications for Determinants of Body Mass Index , 2015 .

[69]  Junlong Zhao,et al.  General Sparse Boosting: Improving Feature Selection of L2 Boosting by Correlation-Based Penalty Family , 2015, Commun. Stat. Simul. Comput..

[70]  P. Diggle,et al.  Analysis of Longitudinal Data , 2003 .

[71]  C. Fortuin,et al.  On the random-cluster model: I. Introduction and relation to other models , 1972 .

[72]  Ying Cui,et al.  Sparse estimation of high-dimensional correlation matrices , 2016, Comput. Stat. Data Anal..