Problems due to small samples and sparse data in conditional logistic regression analysis.

Conditional logistic regression was developed to avoid "sparse-data" biases that can arise in ordinary logistic regression analysis. Nonetheless, it is a large-sample method that can exhibit considerable bias when certain types of matched sets are infrequent or when the model contains too many parameters. Sparse-data bias can cause misleading inferences about confounding, effect modification, dose response, and induction periods, and can interact with other biases. In this paper, the authors describe these problems in the context of matched case-control analysis and provide examples from a study of electrical wiring and childhood leukemia and a study of diet and glioma. The same problems can arise in any likelihood-based analysis, including ordinary logistic regression. The problems can be detected by careful inspection of data and by examining the sensitivity of estimates to category boundaries, variables in the model, and transformations of those variables. One can also apply various bias corrections or turn to methods less sensitive to sparse data than conditional likelihood, such as Bayesian and empirical-Bayes (hierarchical regression) methods.

[1]  S Becker A comparison of maximum likelihood and Jewell's estimators of the odds ratio and relative risk in single 2 x 2 tables. , 1989, Statistics in medicine.

[2]  S Greenland Likelihood-ratio testing as a diagnostic method for small-sample regressions. , 1992, Annals of epidemiology.

[3]  M. Kendall Theoretical Statistics , 1956, Nature.

[4]  S Greenland,et al.  Probability Logic and Probabilistic Induction , 1998, Epidemiology.

[5]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[6]  M. Segal,et al.  An Assessment of Approximate Maximum Likelihood Estimators in Generalized Linear Mixed Models , 1997 .

[7]  Sander Greenland,et al.  Factoring vs linear modeling in rate estimation: A simulation study of relative accuracy , 1998 .

[8]  G. Y. Wong,et al.  The Hierarchical Logistic Regression Model for Multilevel Analysis , 1985 .

[9]  D A Savitz,et al.  The Residential Case‐Specular Method to Study Wire Codes, Magnetic Fields, and Disease , 1998, Epidemiology.

[10]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[11]  S Greenland,et al.  Second-stage least squares versus penalized quasi-likelihood for fitting hierarchical models in epidemiologic analyses. , 1997, Statistics in medicine.

[12]  S Greenland,et al.  Simulation study of hierarchical regression. , 1996, Statistics in medicine.

[13]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[14]  S Greenland,et al.  The relative efficiencies of matched and independent sample designs for case-control studies. , 1983, Journal of chronic diseases.

[15]  Nicholas P. Jewell,et al.  On the Bias of Commonly Used Measures of Association for 2 x 2 Tables , 1986 .

[16]  B. Efron Biased Versus Unbiased Estimation , 1975 .

[17]  D. Thomas,et al.  The problem of multiple inference in identifying point-source environmental hazards. , 1985, Environmental health perspectives.

[18]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[19]  S Greenland,et al.  A unified approach to the analysis of case-distribution (case-only) studies. , 1999, Statistics in medicine.

[20]  Peter Urbach,et al.  Scientific Reasoning: The Bayesian Approach , 1989 .

[21]  N. Jewell,et al.  Some surprising results about covariate adjustment in logistic regression models , 1991 .

[22]  S. Greenland When Should Epidemiologic Regressions Use Random Coefficients? , 2000, Biometrics.

[23]  Mark L. Greenberg,et al.  A case‐control study of childhood leukemia in Southern Ontario, Canada, and exposure to magnetic fields in residences , 1999, International journal of cancer.

[24]  N Breslow,et al.  Approximate hierarchical modelling of discrete data in epidemiology , 1998, Statistical methods in medical research.

[25]  P. McCullagh,et al.  Bias Correction in Generalized Linear Models , 1991 .

[26]  S Greenland,et al.  Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. , 2000, Biostatistics.

[27]  S D Walter,et al.  A comparison of several point estimators of the odds ratio in a single 2 x 2 contingency table. , 1991, Biometrics.

[28]  David R. Brillinger,et al.  Modelling Longitudinal and Spatially Correlated Data: Methods, Applications, and Future Directions , 1998 .

[29]  Jennifer L. Kelsey,et al.  Methods in Observational Epidemiology , 1986 .

[30]  N. Breslow,et al.  Statistical methods in cancer research. Vol. 1. The analysis of case-control studies. , 1981 .

[31]  C. Mantzoros,et al.  Insulin-like growth factor-I in relation to premenopausal ductal carcinoma in situ of the breast. , 1998, Epidemiology.

[32]  S Greenland,et al.  Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. , 1995, Epidemiology.

[33]  J Benichou,et al.  Appetite-suppressant drugs and the risk of primary pulmonary hypertension. International Primary Pulmonary Hypertension Study Group. , 1996, The New England journal of medicine.

[34]  N P Jewell,et al.  Small-sample bias of point estimators of the odds ratio from matched sets. , 1984, Biometrics.

[35]  E White,et al.  Risk of breast cancer among young women: relationship to induced abortion. , 1994, Journal of the National Cancer Institute.

[36]  S Greenland,et al.  Hierarchical regression for epidemiologic analyses of multiple exposures. , 1994, Environmental health perspectives.

[37]  M Feychting,et al.  Magnetic Fields and Breast Cancer in Swedish Adults Residing near High‐Voltage Power Lines , 1998, Epidemiology.

[38]  J. A. Anderson,et al.  Logistic Discrimination and Bias Correction in Maximum Likelihood Estimation , 1979 .

[39]  S Greenland,et al.  Application of the case-specular method to two studies of wire codes and childhood cancers. , 1999, Epidemiology.

[40]  I. Good Some history of the hierarchical Bayesian methodology , 1980 .

[41]  William Q. Meeker,et al.  Assumptions for statistical inference , 1993 .

[42]  C. E. Davis,et al.  Empirical Bayes estimates of subgroup effects in clinical trials. , 1990, Controlled clinical trials.

[43]  S Greenland,et al.  Software for hierarchical modeling of epidemiologic data. , 1998, Epidemiology.

[44]  S Greenland,et al.  Methods for epidemiologic analyses of multiple exposures: a review and comparative study of maximum-likelihood, preliminary-testing, and empirical-Bayes regression. , 1993, Statistics in medicine.

[45]  N. Kashiwagi,et al.  Empirical Bayes methods for smoothing data and for simultaneous estimation of many parameters. , 1990, Environmental health perspectives.

[46]  D. Clayton,et al.  Statistical Models in Epidemiology , 1993 .

[47]  K J Lui A note on the estimate of the relative risk when sample sizes are small. , 1989, Biometrics.

[48]  R. Schaefer Bias correction in maximum likelihood logistic regression. , 1985, Statistics in medicine.

[49]  S D Walter,et al.  Statistical significance and fragility criteria for assessing a difference of two proportions. , 1991, Journal of clinical epidemiology.

[50]  S Greenland,et al.  Hierarchical Regression Analysis Applied to a Study of Multiple Dietary Exposures and Breast Cancer , 1994, Epidemiology.

[51]  M. Netsky,et al.  Neoplasms of the nervous system in Thailand , 1969, Cancer.

[52]  S. Becker,et al.  A comparison of maximum likelihood and jewell's estimator of the odds ratio and relative risk in single 2 × 2 tables , 1989 .