Communication-Efficient Integrative Regression in High-Dimensions

We consider the task of meta-analysis in high-dimensional settings in which the data sources we wish to integrate are similar but non-identical. To borrow strength across such heterogeneous data sources, we introduce a global parameter that addresses several identification issues. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell lines.

[1]  Robert Tibshirani,et al.  Data Shared Lasso: A novel tool to discover uplift , 2016, Comput. Stat. Data Anal..

[2]  Anthony O'Hagan,et al.  Robust meta‐analytic‐predictive priors in clinical trials with historical control information , 2014, Biometrics.

[3]  Yin Xia,et al.  Privacy Preserving Integrative Regression Analysis of High-dimensional Heterogeneous Data , 2019, 1902.06115.

[4]  HIGH-DIMENSIONAL REGRESSION , 2005 .

[5]  ScienceDirect Computational statistics & data analysis , 1983 .

[6]  Adam A. Margolin,et al.  Addendum: The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity , 2012, Nature.

[7]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[8]  Arindam Banerjee,et al.  High Dimensional Data Enrichment: Interpretable, Fast, and Data-Efficient , 2018, ArXiv.

[9]  James O. Berger,et al.  Robust hierarchical Bayes estimation of exchangeable means , 1991 .

[10]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[11]  Vitaly Shmatikov,et al.  How To Break Anonymity of the Netflix Prize Dataset , 2006, ArXiv.

[12]  Qiang Liu,et al.  Communication-efficient sparse regression: a one-shot approach , 2015, ArXiv.

[13]  Wenbin Lu,et al.  Identification of homogeneous and heterogeneous variables in pooled cohort studies , 2015, Biometrics.

[14]  Jianqing Fan,et al.  Distributed Estimation and Inference with Statistical Guarantees , 2015, 1509.05457.

[15]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[16]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[17]  Aiyou Chen,et al.  Data enriched linear regression , 2013, 1304.1837.

[18]  V. Viallon,et al.  Regression modeling on stratified data with the lasso , 2015, 1508.05476.

[19]  High Dimensional Regression on Serum Analytes , 2012 .

[20]  Molei Liu,et al.  Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data , 2019, Journal of the American Statistical Association.

[21]  Yun Yang,et al.  Communication-Efficient Distributed Statistical Inference , 2016, Journal of the American Statistical Association.

[22]  L. Hedges,et al.  Statistical Methods for Meta-Analysis , 1987 .

[23]  Frank Dondelinger,et al.  High-dimensional regression over disease subgroups , 2016, bioRxiv.

[24]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[25]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2011 .