Fused Lasso Approach in Regression Coefficients Clustering - Learning Parameter Heterogeneity in Data Integration

As data sets of related studies become more easily accessible, combining data sets of similar studies is often undertaken in practice to achieve a larger sample size and higher power. A major challenge arising from data integration pertains to data heterogeneity in terms of study population, study design, or study coordination. Ignoring such heterogeneity in data analysis may result in biased estimation and misleading inference. Traditional techniques of remedy to data heterogeneity include the use of interactions and random effects, which are inferior to achieving desirable statistical power or providing a meaningful interpretation, especially when a large number of smaller data sets are combined. In this paper, we propose a regularized fusion method that allows us to identify and merge inter-study homogeneous parameter clusters in regression analysis, without the use of hypothesis testing approach. Using the fused lasso, we establish a computationally efficient procedure to deal with large-scale integrated data. Incorporating the estimated parameter ordering in the fused lasso facilitates computing speed with no loss of statistical power. We conduct extensive simulation studies and provide an application example to demonstrate the performance of the new method with a comparison to the conventional methods.

[1]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[2]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[3]  Wei Pan,et al.  Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty , 2013, J. Mach. Learn. Res..

[4]  Minge Xie,et al.  Confidence Distributions and a Unifying Framework for Meta-Analysis , 2011 .

[5]  Jianqing Fan,et al.  Homogeneity Pursuit , 2015, Journal of the American Statistical Association.

[6]  Yufeng Liu,et al.  Adaptive Estimation with Partially Overlapping Models. , 2016, Statistica Sinica.

[7]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[8]  P. Sullivan,et al.  Genetic epidemiology of major depression: review and meta-analysis. , 2000, The American journal of psychiatry.

[9]  Thomas Lumley,et al.  AIC AND BIC FOR MODELING WITH COMPLEX SURVEY DATA , 2015 .

[10]  Xiaotong Shen,et al.  Grouping Pursuit Through a Regularization Solution Surface , 2010, Journal of the American Statistical Association.

[11]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[12]  L. Hansen LARGE SAMPLE PROPERTIES OF GENERALIZED METHOD OF , 1982 .

[13]  Paul G Shekelle,et al.  Efficacy and safety of ephedra and ephedrine for weight loss and athletic performance: a meta-analysis. , 2003, JAMA.

[14]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[15]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[16]  J. Lawless,et al.  Empirical Likelihood and General Estimating Equations , 1994 .

[17]  E. Lander,et al.  Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease , 2003, Nature Genetics.

[18]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[19]  Jieping Ye,et al.  Feature grouping and selection over an undirected graph , 2012, KDD.

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Fei Wang,et al.  Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements , 2016, Biometrics.

[22]  Raghu Kacker,et al.  Random-effects model for meta-analysis of clinical trials: an update. , 2007, Contemporary clinical trials.

[23]  C M Smith,et al.  Center-specific graft and patient survival rates: 1997 United Network for Organ Sharing (UNOS) report. , 1998, JAMA.

[24]  Minge Xie,et al.  Multivariate Meta-Analysis of Heterogeneous Studies Using Only Summary Statistics: Efficiency and Robustness , 2015, Journal of the American Statistical Association.

[25]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[26]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[27]  G. Glass Primary, Secondary, and Meta-Analysis of Research1 , 1976 .

[28]  P. Song,et al.  Composite Likelihood Bayesian Information Criteria for Model Selection in High-Dimensional Data , 2010 .

[29]  Julian P T Higgins,et al.  Recent developments in meta‐analysis , 2008, Statistics in medicine.