Linear expectile regression under massive data

Abstract In this paper, we study the large-scale inference for a linear expectile regression model. To mitigate the computational challenges in the classical asymmetric least squares (ALS) estimation under massive data, we propose a communication-efficient divide and conquer algorithm to combine the information from sub-machines through confidence distributions. The resulting pooled estimator has a closed-form expression, and its consistency and asymptotic normality are established under mild conditions. Moreover, we derive the Bahadur representation of the ALS estimator, which serves as an important tool to study the relationship between the number of sub-machines K and the sample size. Numerical studies including both synthetic and real data examples are presented to illustrate the finite-sample performance of our method and support the theoretical results.

[1]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[2]  R. Koenker,et al.  Regression Quantiles , 2007 .

[3]  Linda Schulze Waltrup,et al.  Expectile and quantile regression—David and Goliath? , 2015 .

[4]  HaiYing Wang,et al.  More Efficient Estimation for Logistic Regression with Optimal Subsamples , 2018, J. Mach. Learn. Res..

[5]  Xi Chen,et al.  Quantile regression under memory constraint , 2018, The Annals of Statistics.

[6]  Guang Cheng,et al.  Embracing the Blessing of Dimensionality in Factor Models , 2016, Journal of the American Statistical Association.

[7]  Minge Xie,et al.  Combining information from independent sources through confidence distributions , 2005, math/0504507.

[8]  Furno Marilena,et al.  Quantile Regression , 2018, Wiley Series in Probability and Statistics.

[9]  Karim Oualkacha,et al.  A new GEE method to account for heteroscedasticity using asymmetric least-square regressions , 2021, Journal of applied statistics.

[10]  Jing Wu,et al.  Online Updating of Statistical Inference in the Big Data Setting , 2015, Technometrics.

[11]  David P. Woodruff,et al.  Improved Distributed Principal Component Analysis , 2014, NIPS.

[12]  Yong Zhou,et al.  A Varying-Coefficient Expectile Model for Estimating Value at Risk , 2014 .

[13]  Rong Zhu,et al.  Optimal Subsampling for Large Sample Logistic Regression , 2017, Journal of the American Statistical Association.

[14]  S. Girard,et al.  Estimation of tail risk based on extreme expectiles , 2016 .

[15]  Yong Zhou,et al.  Quantile regression in big data: A divide and conquer based strategy , 2020, Comput. Stat. Data Anal..

[16]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[17]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[18]  Dong Wang,et al.  Distributed estimation of principal eigenspaces. , 2017, Annals of statistics.

[19]  Yun Yang,et al.  Communication-Efficient Distributed Statistical Inference , 2016, Journal of the American Statistical Association.

[20]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[21]  Qiang Liu,et al.  Communication-efficient Sparse Regression , 2017, J. Mach. Learn. Res..

[22]  Xi Chen,et al.  First-Order Newton-Type Estimator for Distributed Estimation and Inference , 2018, Journal of the American Statistical Association.

[23]  Min Yang,et al.  Information-Based Optimal Subdata Selection for Big Data Linear Regression , 2017, Journal of the American Statistical Association.

[24]  James W. Taylor Estimating Value at Risk and Expected Shortfall Using Expectiles , 2007 .

[25]  Ohad Shamir,et al.  Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis , 2017, ICML.

[26]  Guang Cheng,et al.  Distributed inference for quantile regression processes , 2017, The Annals of Statistics.

[27]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[28]  Jianqing Fan,et al.  DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS. , 2018, Annals of statistics.

[29]  Guang Cheng,et al.  Quantile Processes for Semi and Nonparametric Regression , 2016, 1604.02130.

[30]  Yanyuan Ma,et al.  Optimal subsampling for quantile regression in big data , 2020, Biometrika.

[31]  Ruibin Xi,et al.  Aggregated estimating equation estimation , 2011 .

[32]  W. Newey,et al.  Asymmetric Least Squares Estimation and Testing , 1987 .

[33]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[34]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[35]  K. Singh,et al.  Confidence Distribution, the Frequentist Distribution Estimator of a Parameter: A Review , 2013 .

[36]  B. Efron Bayes and likelihood calculations from confidence intervals , 1993 .

[37]  Chung-Ming Kuan,et al.  Assessing Value at Risk With CARE, the Conditional Autoregressive Expectile Models , 2008 .