Accurate Computation of Survival Statistics in Genome-Wide Studies

A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations.

[1]  James V. Neel,et al.  lessons from , 2010 .

[2]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data , 1980 .

[3]  Gord Glendon,et al.  Association Between BRCA1 and BRCA2 Mutations and Survival in Women With Invasive Epithelial Ovarian Cancer , 2012 .

[4]  Michael A. Choti,et al.  DAXX/ATRX, MEN1, and mTOR Pathway Genes Are Frequently Altered in Pancreatic Neuroendocrine Tumors , 2011, Science.

[5]  P. Kleihues,et al.  IDH1 Mutations as Molecular Signature and Predictive Factor of Secondary Glioblastomas , 2009, Clinical Cancer Research.

[6]  M. Pagano,et al.  On Obtaining Permutation Distributions in Polynomial Time , 1983 .

[7]  Han Liu,et al.  Clinical and pathologic impact of select chromatin-modulating tumor suppressors in clear cell renal cell carcinoma. , 2013, European urology.

[8]  F. Jänicke,et al.  Prognostic relevance of AIB1 (NCoA3) amplification and overexpression in breast cancer , 2013, Breast Cancer Research and Treatment.

[9]  E. Giné,et al.  Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia , 2011, Nature Genetics.

[10]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of clear cell renal cell carcinoma , 2013, Nature.

[11]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[12]  David C Christiani,et al.  Genome-wide analysis of survival in early-stage non-small-cell lung cancer. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[13]  F. Ducray,et al.  IDH1 and IDH2 mutations in gliomas. , 2009, The New England journal of medicine.

[14]  J. Peto,et al.  Asymptotically Efficient Rank Invariant Test Procedures , 1972 .

[15]  Eric Vigoda,et al.  An FPTAS for #Knapsack and Related Counting Problems , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[16]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[17]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[18]  Ronald W. Butler,et al.  Log‐rank permutation tests for trend: saddlepoint p‐values and survival rate confidence intervals , 2009 .

[19]  Barbara Burwinkel,et al.  Association of NCOA3 Polymorphisms with Breast Cancer Risk , 2005, Clinical Cancer Research.

[20]  Uri Keich sFFT: A Faster Accurate Computation of the p-Value of the Entropy Score , 2005, J. Comput. Biol..

[21]  Goro Takahashi,et al.  Galanin has tumor suppressor activity and is frequently inactivated by aberrant promoter methylation in head and neck cancer. , 2013, Translational oncology.

[22]  Chul-Kee Park,et al.  IDH1 mutation of gliomas with long-term survival analysis. , 2012, Oncology reports.

[23]  Albrecht M. Kellerer,et al.  Small-Sample Properties of Censored-Data Rank Tests , 1983 .

[24]  R. Latta,et al.  A Monte Carlo Study of Some Two-Sample Rank Tests with Censored Data , 1981 .

[25]  Gilbert MacKenzie,et al.  The Statistical Analysis of Failure Time Data , 1982 .

[26]  Verena I Gaidzik,et al.  RUNX1 mutations in acute myeloid leukemia: results from a comprehensive genetic and clinical analysis from the AML study group. , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[27]  N. Mantel Evaluation of survival data and two new rank order statistics arising in its consideration. , 1966, Cancer chemotherapy reports.

[28]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[29]  P. Bugert,et al.  c‐MYC Asn11Ser is associated with increased risk for familial breast cancer , 2005, International journal of cancer.

[30]  W. Haenszel,et al.  Statistical aspects of the analysis of data from retrospective studies of disease. , 1959, Journal of the National Cancer Institute.

[31]  Uri Keich,et al.  Computing the P-value of the information content from an alignment of multiple sequences , 2005, ISMB.

[32]  Michael Gnant,et al.  Exact Log‐Rank Tests for Unequal Follow‐Up , 2003, Biometrics.

[33]  Nathan Mantel,et al.  Propriety of the Mantel - Haenszel variance for the log rank test , 1985 .

[34]  Mark Brown,et al.  On the choice of variance for the log rank test , 1984 .

[35]  P Peduzzi,et al.  Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. , 1995, Journal of clinical epidemiology.

[36]  R. Guillevin,et al.  IDH1 or IDH2 mutations predict longer survival and response to temozolomide in low-grade gliomas , 2010, Neurology.

[37]  Zhengyan Kan,et al.  Exome sequencing identifies frequent mutation of ARID1A in molecular subtypes of gastric cancer , 2011, Nature Genetics.

[38]  J. Concato,et al.  Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. , 1995, Journal of clinical epidemiology.

[39]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[40]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[41]  V. Heinemann,et al.  Clinical relevance of EGFR- and KRAS-status in colorectal cancer patients treated with monoclonal antibodies directed against the EGFR. , 2009, Cancer treatment reviews.

[42]  Puthen V. Jithesh,et al.  Identification of Galanin and Its Receptor GalR1 as Novel Determinants of Resistance to Chemotherapy and Potential Biomarkers in Colorectal Cancer , 2012, Clinical Cancer Research.

[43]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[44]  The Cancer Genome Atlas Research Network COMPREHENSIVE MOLECULAR CHARACTERIZATION OF CLEAR CELL RENAL CELL CARCINOMA , 2013, Nature.

[45]  E. Lander,et al.  Lessons from the Cancer Genome , 2013, Cell.

[46]  Thomas E. Carey,et al.  Epigenetic Inactivation of Galanin Receptor 1 in Head and Neck Cancer , 2008, Clinical Cancer Research.

[47]  R. Fisher 019: On the Interpretation of x2 from Contingency Tables, and the Calculation of P. , 1922 .

[48]  Douglas F. Easton,et al.  Association between BRCA1 and BRCA2 mutations and survival in women with invasive epithelial ovarian cancer. , 2012, JAMA.

[49]  Naftali Tishby,et al.  Efficient Exact p-Value Computation for Small Sample, Sparse, and Surprising Categorical Data , 2004, J. Comput. Biol..

[50]  Robert Tibshirani,et al.  Survival analysis with high-dimensional covariates , 2010, Statistical methods in medical research.

[51]  D C Linch,et al.  Impact of NOTCH1/FBXW7 mutations on outcome in pediatric T-cell acute lymphoblastic leukemia patients treated on the MRC UKALL 2003 trial , 2013, Leukemia.

[52]  Martin E. Dyer,et al.  A Mildly Exponential Time Algorithm for Approximating the Number of Solutions to a Multidimensional Knapsack Problem , 1993, Combinatorics, Probability and Computing.

[53]  Tao Wang,et al.  Comparison of statistics in association tests of genetic markers for survival outcomes , 2014, Statistics in medicine.

[54]  G Heimann,et al.  Permutational distribution of the log-rank statistic under random censorship with applications to carcinogenicity assays. , 1998, Biometrics.

[55]  Cyrus R. Mehta,et al.  Computing an Exact Confidence Interval for the Common Odds Ratio in Several 2×2 Contingency Tables , 1985 .

[56]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2010 .

[57]  Sang Kyun Sohn,et al.  VARS2 V552V variant as prognostic marker in patients with early breast cancer , 2011, Medical oncology.