An Online Updating Approach for Testing the Proportional Hazards Assumption with Streams of Big Survival Data.

The Cox model, which remains as the first choice in analyzing time-to-event data even for large datasets, relies on the proportional hazards (PH) assumption. When survival data arrive sequentially in chunks, a fast and minimally storage intensive approach to test the PH assumption is desirable. We propose an online updating approach that updates the standard test statistic as each new block of data becomes available, and greatly lightens the computational burden. Under the null hypothesis of PH, the proposed statistic is shown to have the same asymptotic distribution as the standard version computed on the entire data stream with the data blocks pooled into one dataset. In simulation studies, the test and its variant based on most recent data blocks maintain their sizes when the PH assumption holds and have substantial power to detect different violations of the PH assumption. We also show in simulation that our approach can be used successfully with "big data" that exceed a single computer's computational resources. The approach is illustrated with the survival analysis of patients with lymphoma cancer from the Surveillance, Epidemiology, and End Results Program. The proposed test promptly identified deviation from the PH assumption that was not captured by the test based on the entire data. This article is protected by copyright. All rights reserved.

[1]  N. Nagelkerke,et al.  A simple test for goodness-of-fit of Cox''s proportional hazards model Biometrics 40 , 1984 .

[2]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[3]  Chi-Hyuck Jun,et al.  A New Control Scheme Always Better Than X-Bar Chart , 2010 .

[4]  Geoffrey I. Webb,et al.  A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests , 2016, KDD.

[5]  Yan Wang,et al.  A fast divide-and-conquer sparse Cox regression. , 2018, Biostatistics.

[6]  P. Grambsch,et al.  Proportional hazards tests and diagnostics based on weighted residuals , 1994 .

[7]  A. Dreher Modeling Survival Data Extending The Cox Model , 2016 .

[8]  Danyu Lin,et al.  Goodness-of-Fit Analysis for the Cox Regression Model Based on a Class of Parameter Estimators , 1991 .

[9]  Shuangge Ma,et al.  PENALIZED VARIABLE SELECTION PROCEDURE FOR COX MODELS WITH SEMIPARAMETRIC RELATIVE RISK. , 2010, Annals of statistics.

[10]  Ming-Hui Chen,et al.  Online updating method with new variables for big data streams , 2018, The Canadian journal of statistics = Revue canadienne de statistique.

[11]  Hui Zou,et al.  A note on path-based variable selection in the penalized proportional hazards model , 2008 .

[12]  D. Schoenfeld,et al.  Sample-size formula for the proportional-hazards regression model. , 1983, Biometrics.

[13]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[14]  Rong Zhu,et al.  Optimal Subsampling for Large Sample Logistic Regression , 2017, Journal of the American Statistical Association.

[15]  Ruibin Xi,et al.  Aggregated estimating equation estimation , 2011 .

[16]  Jianqing Fan,et al.  Variable Selection for Cox's proportional Hazards Model and Frailty Model , 2002 .

[17]  Adel Javanmard,et al.  Online Rules for Control of False Discovery Rate and False Discovery Exceedance , 2016, ArXiv.

[18]  F. Liang,et al.  A split‐and‐merge Bayesian variable selection approach for ultrahigh dimensional regression , 2015 .

[19]  Jing Wu,et al.  Online Updating of Statistical Inference in the Big Data Setting , 2015, Technometrics.

[20]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[21]  B. Efron The Efficiency of Cox's Likelihood Function for Censored Data , 1977 .

[22]  David A. Schoenfeld,et al.  Chi-squared goodness-of-fit tests for the proportional hazards regression model , 1980 .

[23]  P. Grambsch,et al.  A Package for Survival Analysis in S , 1994 .

[24]  Ming-Hui Chen,et al.  Statistical methods and computing for big data. , 2015, Statistics and its interface.

[25]  R. Gill,et al.  Cox's regression model for counting processes: a large sample study : (preprint) , 1982 .

[26]  Yishu Xue,et al.  Diagnostics for the Cox model , 2017 .

[27]  Chi-Hyuck Jun,et al.  A Process Monitoring Scheme Controlling False Discovery Rate , 2012 .

[28]  D.,et al.  Regression Models and Life-Tables , 2022 .

[29]  R. Prentice,et al.  Commentary on Andersen and Gill's "Cox's Regression Model for Counting Processes: A Large Sample Study" , 1982 .

[30]  Katharina Burger,et al.  Counting Processes And Survival Analysis , 2016 .

[31]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[32]  R. Gill,et al.  A simple test of the proportional hazards assumption , 1987 .

[33]  David Madigan,et al.  High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis. , 2014, Biostatistics.

[34]  Jianqing Fan,et al.  DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS. , 2018, Annals of statistics.

[35]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .