Ecological inference for 2 × 2 tables

Summary.  A fundamental problem in many disciplines, including political science, sociology and epidemiology, is the examination of the association between two binary variables across a series of 2 × 2 tables, when only the margins are observed, and one of the margins is fixed. Two unobserved fractions are of interest, with only a single response per table, and it is this non‐identifiability that is the inherent difficulty lying at the heart of ecological inference. Many methods have been suggested for ecological inference, often without a probabilistic model; we clarify the form of the sampling distribution and critique previous approaches within a formal statistical framework, thus allowing clarification and examination of the assumptions that are required under all approaches. A particularly difficult problem is choosing between models with and without contextual effects. Various Bayesian hierarchical modelling approaches are proposed to allow the formal inclusion of supplementary data, and/or prior information, without which ecological inference is unreliable. Careful choice of the prior within such models is required, however, since there may be considerable sensitivity to this choice, even when the model assumed is correct and there are no contextual effects. This sensitivity is shown to be a function of the number of areas and the distribution of the proportions in the fixed margin across areas. By explicitly providing a likelihood for each table, the combination of individual level survey data and aggregate level data is straightforward and we illustrate that survey data can be highly informative, particularly if these data are from a survey of the minority population within each area. This strategy is related to designs that are used in survey sampling and in epidemiology. An approximation to the suggested likelihood is discussed, and various computational approaches are described. Some extensions are outlined including the consideration of multiway tables, spatial dependence and area‐specific (contextual) variables. Voter registration–race data from 64 counties in the US state of Louisiana are used to illustrate the methods.

[1]  Sylvia Richardson,et al.  Improving ecological inference using individual‐level data , 2006, Statistics in medicine.

[2]  Jon Wakefield,et al.  Health-exposure modeling and the ecological fallacy. , 2005, Biostatistics.

[3]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[4]  Ori R OSEN,et al.  Fast and Stable Algorithms for Computing and Sampling From the Noncentral Hypergeometric Distribution , 2005 .

[5]  Simon Day,et al.  Statistical analysis of performance indicators in UK higher education , 2005 .

[6]  Eric J. Beh,et al.  The Information in Aggregate Data , 2004 .

[7]  Duncan C. Thomas,et al.  Statistical Issues in Studies of the Long-Term Effects of Air Pollution: The Southern California Children’s Health Study , 2004 .

[8]  J. Corder,et al.  Using Prior Information to Aid Ecological Inference: A Bayesian Approach , 2004 .

[9]  K. Rice,et al.  Equivalence Between Conditional and Mixture Approaches to the Rasch Model and Matched Case-Control Studies, With Applications , 2004 .

[10]  R. Prentice,et al.  Dietary fat and cancer: consistency of the epidemiologic data, and disease prevention that may follow from a practical reduction in fat consumption , 1990, Cancer Causes & Control.

[11]  J. Wakefield,et al.  Ecological Inference: Prior and Likelihood Choices in the Analysis of Ecological Data , 2004 .

[12]  Jonathan Wakefield,et al.  Ecological Inference: Ecological Inference Incorporating Spatial Dependence , 2004 .

[13]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[14]  Renato M. Assunção,et al.  Space varying coefficient models for small area data , 2003 .

[15]  Lianne Sheppard,et al.  Insights on bias and information in group-level studies. , 2003, Biostatistics.

[16]  Allen Cheadle,et al.  Combining Aggregate and Individual Level Data to Estimate an Individual Level Correlation Coefficient , 2003 .

[17]  Jon Wakefield,et al.  Sensitivity Analyses for Ecological Regression , 2003, Biometrics.

[18]  Gary King,et al.  A Consensus on Second-Stage Analyses in Ecological Inference Models , 2003, Political Analysis.

[19]  Jon Wakefield,et al.  A hierarchical aggregate data model with spatially correlated disease rates. , 2002, Biometrics.

[20]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[21]  N. Thrift International Encyclopaedia of the Social and Behavioural Sciences , 2002 .

[22]  Stephen M. Stigler,et al.  The missing early history of contingency tables , 2002 .

[23]  Adrian Dobra,et al.  Assessing the Risk of Disclosure of Confidential Categorical Data , 2002 .

[24]  S. Greenland Ecologic versus individual-level sources of bias in ecologic estimates of contextual health effects. , 2001, International journal of epidemiology.

[25]  R. H. Myers Generalized Linear Models: With Applications in Engineering and the Sciences , 2001 .

[26]  S. Richardson,et al.  Ecological correlation studies , 2001 .

[27]  P. Simpson,et al.  Statistical methods in cancer research , 2001, Journal of surgical oncology.

[28]  David G Steel,et al.  Simple methods for ecological inference in 2×2 tables , 2001 .

[29]  Andrew Gelman,et al.  Models, assumptions and model checking in ecological regressions , 2001 .

[30]  Ruth Salway,et al.  A statistical framework for ecological and aggregate studies , 2001 .

[31]  L. Martino,et al.  Optimal Sampling Design with Random Size Clusters for a Mixed Model with Measurement Errors , 2001 .

[32]  Nicola G. Best,et al.  A shared component model for detecting joint and selective clustering of two diseases , 2001 .

[33]  M. Tanner,et al.  Bayesian and Frequentist Inference for Ecological Inference: The R×C Case , 2001 .

[34]  J. Besag,et al.  Inference on a collapsed margin in disease mapping. , 2000, Statistics in medicine.

[35]  S. Greenland When Should Epidemiologic Regressions Use Random Coefficients? , 2000, Biometrics.

[36]  Barry C. Burden Voter Turnout and the National Election Studies , 2000, Political Analysis.

[37]  Cyrus R. Mehta,et al.  Efficient Monte Carlo Methods for Conditional Logistic Regression , 2000 .

[38]  C Guihenneuc-Jouyaux,et al.  Biases in ecological studies: utility of including within-area distribution of confounders. , 2000, Statistics in medicine.

[39]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[40]  Leland Gerson Neuberg,et al.  A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data , 1999 .

[41]  M. Tiefelsdorf Modelling spatial processes : the identification and analysis of spatial relationships in regression residuals by means of Moran's I : with 32 figures and 8 talbes , 1999 .

[42]  P Elliott,et al.  Issues in the statistical analysis of small area health data. , 1999, Statistics in medicine.

[43]  Edward L. Korn,et al.  Analysis of Health Surveys , 1999 .

[44]  Gary King,et al.  Binomial-Beta Hierarchical Models for Ecological Inference , 1999 .

[45]  E. Korn,et al.  Analysis of Health Surveys: Korn/Analysis , 1999 .

[46]  Ronald W. Butler,et al.  An importance sampling algorithm for exact conditional tests in log-linear models , 1999 .

[47]  Michael Ostland,et al.  Response to King's Comment , 1999 .

[48]  Gary King,et al.  The Future of Ecological Inference Research: A Reply to Freedman Et Al , 2008 .

[49]  J. Pearl,et al.  Confounding and Collapsibility in Causal Inference , 1999 .

[50]  Nilanjan Chatterjee,et al.  Design and analysis of two‐phase studies with binary outcome applied to Wilms tumour prognosis , 1999 .

[51]  D. Freedman Ecological Inference and the Ecological Fallacy , 1999 .

[52]  Approximately Exact Inference for the Common Odds Ration in Several 2 × 2 Tables: Comment , 1998 .

[53]  Robert L. Strawderman,et al.  Approximately Exact Inference for the Common Odds Ratio in Several 2 × 2 Tables , 1998 .

[54]  L Knorr-Held,et al.  Modelling risk from a disease in time and space. , 1998, Statistics in medicine.

[55]  W W Hauck,et al.  Should we adjust for covariates in nonlinear regression analyses of randomized trials? , 1998, Controlled clinical trials.

[56]  R. Wolpert,et al.  Poisson/gamma random field models for spatial statistics , 1998 .

[57]  Wendy K. Tam Cho,et al.  Iff the Assumption Fits…: A Comment on the King Ecological Inference Solution , 1998 .

[58]  Jonathan J. Forster,et al.  Model‐based inference for categorical survey data subject to non‐ignorable non‐response , 1998 .

[59]  A. Scott,et al.  Fitting regression models to case-control data by maximum likelihood , 1997 .

[60]  D. Freedman,et al.  A solution to the ecological inference problem , 1997 .

[61]  Norman E. Breslow,et al.  Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling , 1997 .

[62]  Alice S. Whittemore,et al.  Multistage Sampling Designs and Estimating Equations , 1997 .

[63]  J. Copas,et al.  Inference for Non‐random Samples , 1997 .

[64]  Jonathan J. Forster,et al.  Monte Carlo exact conditional tests for log-linear and logistic models , 1996 .

[65]  C Guihenneuc-Jouyaux,et al.  Re: Ecologic studies--biases, misconceptions, and counterexamples. , 1996, American journal of epidemiology.

[66]  A. Raftery,et al.  Discharge Rates of Medicare Stroke Patients to Skilled Nursing Facilities: Bayesian Logistic Regression with Unobserved Heterogeneity , 1996 .

[67]  David Clayton,et al.  Estimation of Population Exposure in Ecological Studies , 1996 .

[68]  L. Sheppard,et al.  On the reliability and precision of within- and between- population estimates of relative rate parameters. , 1995, Biometrics.

[69]  L. Pickle,et al.  Effects of the choice of age-adjustment method on maps of death rates. , 1995, Statistics in medicine.

[70]  Ross L. Prentice,et al.  Aggregate data studies of disease risk factors , 1995 .

[71]  Allan L. McCutcheon,et al.  Cross-Level Inference , 1995 .

[72]  B. Cohen,et al.  Divergent biases in ecologic and individual level studies. , 1995, Statistics in medicine.

[73]  Philip J. Brown,et al.  Evaluation of Methods for Ecological Inference , 1995 .

[74]  Micheal W. Giles,et al.  Racial Threat and Partisan Identification , 1994, American Political Science Review.

[75]  J. Robins,et al.  Invited commentary: ecologic studies--biases, misconceptions, and counterexamples. , 1994, American journal of epidemiology.

[76]  N. Reid,et al.  Information, ancillarity, and sufficiency in the presence of nuisance parameters† , 1994 .

[77]  C Montomoli,et al.  Spatial correlation in ecological analysis. , 1993, International journal of epidemiology.

[78]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .

[79]  Jiangang Liao An algorithm for the mean and variance of the noncentral hypergeometric distribution , 1992 .

[80]  Jerome Sacks,et al.  Ecological Regression and Voting Rights , 1991 .

[81]  S. Berglund,et al.  Assessing the validity of the logit method for ecological inference , 1991 .

[82]  J. Besag,et al.  Bayesian image restoration, with two applications in spatial statistics , 1991 .

[83]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[84]  J C Wakefield,et al.  Hierarchical models for multicentre binary response studies. , 1990, Statistics in medicine.

[85]  Continued fraction representation for expected cell counts of a 2 x 2 table: a rapid and exact method for conditional maximum likelihood estimation. , 1990, Biometrics.

[86]  S Greenland,et al.  Ecological bias, confounding, and effect modification. , 1989, International journal of epidemiology.

[87]  G. Firebaugh,et al.  Danish Elections 1920-1979: A Logit Approach to Ecological Analysis and Inference. , 1988 .

[88]  S. Piantadosi,et al.  The ecological fallacy. , 1988, American journal of epidemiology.

[89]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[90]  References to discussion , 1988 .

[91]  David Firth,et al.  On the efficiency of quasi-likelihood estimation , 1987 .

[92]  D Hémon,et al.  Comparison of relative risks obtained in ecological and individual studies: some methodological considerations. , 1987, International journal of epidemiology.

[93]  Søren Risbjerg Thomsen, Danish Elections 1920-79. A Logit Approach to Ecological Analysis and Inference , 1987 .

[94]  Clive Payne,et al.  Aggregate Data, Ecological Regression, and Voting Transitions , 1986 .

[95]  R. Little,et al.  A note about models for selectivity bias. , 1985 .

[96]  Alchemist’s Gold: Inferring Individual Relationships from Aggregate Data , 1985 .

[97]  D B Rubin,et al.  Difficulties with regression analyses of age-adjusted rates. , 1984, Biometrics.

[98]  D. Cook,et al.  Multiple regression in geographical mortality studies, with allowance for spatially correlated errors. , 1983, Biometrics.

[99]  P. McCullagh Quasi-Likelihood Functions , 1983 .

[100]  D. Rubin,et al.  Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome , 1983 .

[101]  N. Breslow Design and analysis of case-control studies. , 1982, Annual review of public health.

[102]  J E White,et al.  A two stage design for the study of the relationship between a rare exposure and a rare disease. , 1982, American journal of epidemiology.

[103]  Charles F. Manski,et al.  Alternative Estimators and Sample Designs for Discrete Choice Analysis , 1981 .

[104]  N. Breslow,et al.  Statistical methods in cancer research: volume 1- The analysis of case-control studies , 1980 .

[105]  J. Heckman Sample selection bias as a specification error , 1979 .

[106]  M. Gail The Analysis of Heterogeneity for Indirect Standardized Mortality Ratios , 1978 .

[107]  R. L. Plackett,et al.  The marginal totals of a 2×2 table , 1977 .

[108]  G. Iversen Recovering Individual Data in the Presence of Group and Individual Effects , 1973, American Journal of Sociology.

[109]  Tom Leonard Bayesian methods for binomial data , 1972 .

[110]  N. L. Johnson,et al.  Distributions in Statistics: Discrete Distributions. , 1970 .

[111]  W. A. Ericson Subjective Bayesian Models in Sampling Finite Populations , 1969 .

[112]  P. M. E. Altham,et al.  Exact Bayesian Analysis of a 2 Times 2 Contingency Table, and Fisher's “Exact” Significance Test , 1969 .

[113]  Alan G. Hawkes,et al.  An Approach to the Analysis of Electoral Swing , 1969 .

[114]  Leo A. Goodman,et al.  Some Alternatives to Ecological Correlation , 1959, American Journal of Sociology.

[115]  E. C. Hammond,et al.  Smoking and lung cancer: recent evidence and a discussion of some questions. , 1959, Journal of the National Cancer Institute.

[116]  H. Selvin Durkheim's Suicide and Problems of Empirical Research , 1958, American Journal of Sociology.

[117]  H. Daniels Saddlepoint Approximations in Statistics , 1954 .

[118]  Otis Dudley Duncan,et al.  An Alternative to Ecological Correlation , 1953 .

[119]  L. A. Goodman Ecological Regressions and Behavior of Individuals , 1953 .

[120]  E. H. Simpson,et al.  The Interpretation of Interaction in Contingency Tables , 1951 .

[121]  D.Sc. Joseph Berkson Are there Two Regressions , 1950 .

[122]  Karl Pearson,et al.  ON A METHOD OF ASCERTAINING LIMITS TO THE ACTUAL NUMBER OF MARKED MEMBERS IN A POPULATION OF GIVEN SIZE FROM A SAMPLE , 1928 .

[123]  G. Yule NOTES ON THE THEORY OF ASSOCIATION OF ATTRIBUTES IN STATISTICS , 1903 .