Computing confidence intervals from massive data via penalized quantile smoothing splines

Abstract New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA’s Earth Exchange.

[1]  C. Loader,et al.  Simultaneous Confidence Bands for Linear Regression and Smoothing , 1994 .

[2]  G. Danabasoglu,et al.  The Community Climate System Model Version 4 , 2011 .

[3]  Gerhard Dikta,et al.  Bootstrap approximation of nearest neighbor regression function estimates , 1990 .

[4]  E. Maurer,et al.  Technical Note: Bias correcting climate model simulated daily temperature extremes with quantile mapping , 2012 .

[5]  S. Sain,et al.  Confidence Regions for Spatial Excursion Sets From Repeated Random Field Observations, With an Application to Climate , 2018, Journal of the American Statistical Association.

[6]  D. Nychka,et al.  Period analysis of variable stars by robust smoothing , 2004 .

[7]  Doug Nychka,et al.  A Nonparametric Regression Approach to Syringe Grading for Quality Improvement , 1995 .

[8]  Lukas H. Meyer,et al.  Summary for Policymakers , 2022, The Ocean and Cryosphere in a Changing Climate.

[9]  H. Bondell,et al.  Flexible Bayesian quantile regression for independent and clustered data. , 2010, Biostatistics.

[10]  Robert J. Renka,et al.  Algorithm 751: TRIPACK: a constrained two-dimensional Delaunay triangulation package , 1996, TOMS.

[11]  Pin T. Ng,et al.  A Frisch-Newton Algorithm for Sparse Quantile Regression , 2005 .

[12]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[13]  Philip T Reiss,et al.  The International Journal of Biostatistics Smoothness Selection for Penalized Quantile Regression Splines , 2012 .

[14]  James O. Ramsay,et al.  Applied Functional Data Analysis: Methods and Case Studies , 2002 .

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[16]  R. Koenker Quantile Regression: Name Index , 2005 .

[17]  James Stephen Marron,et al.  BOOTSTRAP SIMULTANEOUS ERROR BARS FOR NONPARAMETRIC REGRESSION , 1991 .

[18]  Ming Yuan,et al.  GACV for quantile smoothing splines , 2006, Comput. Stat. Data Anal..

[19]  R. Koenker,et al.  Penalized triograms: total variation regularization for bivariate smoothing , 2004 .

[20]  D. Cox Asymptotics for $M$-Type Smoothing Splines , 1983 .

[21]  A. Thomson,et al.  The representative concentration pathways: an overview , 2011 .

[22]  Guang Cheng,et al.  Computational Limits of A Distributed Algorithm for Smoothing Spline , 2015, J. Mach. Learn. Res..

[23]  Jonathan Kua,et al.  A Survey of Rate Adaptation Techniques for Dynamic Adaptive Streaming Over HTTP , 2017, IEEE Communications Surveys & Tutorials.

[24]  James Serrin,et al.  On the definition and properties of certain variational integrals , 1961 .

[25]  Y. Ye,et al.  A convergent algorithm for quantile regression with smoothing splines , 1995 .

[26]  P. Hall The Bootstrap and Edgeworth Expansion , 1992 .

[27]  P. Bickel,et al.  ON THE CHOICE OF m IN THE m OUT OF n BOOTSTRAP AND CONFIDENCE BOUNDS FOR EXTREMA , 2008 .

[28]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[29]  Roger Koenker,et al.  Inequality constrained quantile regression , 2005 .

[30]  Pin T. Ng,et al.  Quantile smoothing splines , 1994 .

[31]  G. Wahba Spline models for observational data , 1990 .

[32]  Xuming He,et al.  Bayesian empirical likelihood for quantile regression , 2012, 1207.5378.