Still Much to Learn About Confidence Intervals

Confidence intervals (CIs), rather than p values, should often provide the major justification for conclusions drawn from data. Therefore, CIs should be reported, and also interpreted. Rouder and Morey (2005) distinguished ‘‘arelational’’ CIs (e.g., CIs around single sample means) and ‘‘relational’’ CIs (e.g., CIs around mean differences or standardized effect sizes). They argued that the former are not suitable for inference and that researchers are justified in not interpreting such intervals. Yet, the purpose of research using samples is almost always to make inferences to populations. CIs—including arelational CIs—are, by design, inferential statistics and can legitimately serve to justify inferential conclusions. Rouder and Morey argued that rather than using arelational CIs for inference, authors should exploit the fact that they ‘‘provide a rough guide to variability in data, a coarse view of the replicability of patterns, and a quick check of the heterogeneity of variance’’ (p. 77). We believe there are problems with these three suggestions. First, variability in data is represented directly by descriptive statistics, such as the standard deviation. A CI, by contrast, is often based on a standard error and influenced by sample size. Similar levels of variability will give CIs of very different widths, depending on group size, so CIs should not be relied on to give even a rough guide to variability in data. Second, CIs do give information about replicability, but we (Cumming, Williams, & Fidler, 2004) reported that a majority of researchers, seeing a CI, markedly underestimate the true extent of variability over replications. Further, Maxwell (2004, p. 157) pointed out that in many realistic research situations, the pattern of results shown by CIs is unstable over replication. Finally, for examining heterogeneity of variance, descriptive rather than inferential statistics—standard deviations rather than standard errors or CIs—are again needed. Only if group sizes are equal will CIs give an accurate guide. Rouder and Morey’s comments reinforce the need to report standard deviations, but do not justify noninterpretation of CIs. CIs are rarely reported in journals outside medicine (Kieffer, Reese, & Thompson, 2001). Even in medicine, where they have been routinely reported for two decades, they are rarely interpreted (Fidler, Thomason, Cumming, Finch, & Leeman, 2004). Guidelines for and examples of good practice are lacking, and we support research to develop and evaluate better guidelines for use and interpretation of CIs. Thompson (2002) noted, ‘‘It is conceivable that some researchers may not fully understand statistical methods that they (a) rarely read in the literature and (b) infrequently use in their own work’’ (p. 26). For example, it is widely believed (Belia, Fidler, Williams, & Cumming, 2004; Schenker & Gentleman, 2001) that two 95% CIs having zero overlap—just touching end to end—are equivalent to statistical significance with p 5 .05. In fact, for 95% CIs on two independent means, overlap by about one quarter of the total length of one interval corresponds to a p value of about .05 (Cumming & Finch, 2005; Saville, 2003; Wolfe & Hanley, 2002). Rouder and Morey argued that ‘‘arelational CIs . . . do not reflect between-groups information and cannot be used for direct comparisons’’ (p. 77). This is true for repeated measure designs, in which CIs on separate cell means do not provide the relevant information for a comparison, but it does not hold for independent groups. For two independent groups, the difference between the means has a p value of about .05 when the separate 95% CIs overlap by about 25% of the length of either interval, and a p value of about .01 when the two intervals just touch end to end (see Cumming & Finch, 2005, for a discussion of the breadth of applicability of these rules). The terms arelational and relational might be useful in describing the type of CIs reported. However, such distinctions should not be used to determine the use of CIs. Of course, as for statistical tests, thought should be given to what is the most appropriate CI for the situation (Wilkinson & the Task Force on Statistical Address correspondence to Fiona Fidler, Department of History and Philosophy of Science, University of Melbourne, 3010, Victoria, Australia; e-mail: fidlerfm@unimelb.edu.au. PSYCHOLOGICAL SCIENCE

[1]  D. Saville,et al.  Basic statistics and the inconsistency of multiple comparison procedures. , 2003, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[2]  M. Masson,et al.  Using confidence intervals in within-subject designs , 1994, Psychonomic bulletin & review.

[3]  Leland Wilkinson,et al.  Statistical Methods in Psychology Journals Guidelines and Explanations , 2005 .

[4]  M. Masson Using confidence intervals for graphically based data interpretation. , 2003, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[5]  James Hanley,et al.  If we're so different, why do we keep overlapping? When 1 plus 1 doesn't make 2. , 2002, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[6]  G. Cumming,et al.  Editors Can Lead Researchers to Confidence Intervals, but Can't Make Them Think , 2004, Psychological science.

[7]  G. Cumming,et al.  Researchers misunderstand confidence intervals and standard error bars. , 2005, Psychological methods.

[8]  S. Maxwell The persistence of underpowered studies in psychological research: causes, consequences, and remedies. , 2004, Psychological methods.

[9]  Bruce Thompson,et al.  Statistical Techniques Employed in AERJ and JCP Articles from 1988 to 1997: A Methodological Review , 2001 .

[10]  N. Schenker,et al.  On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals , 2001 .

[11]  B. Thompson What Future Quantitative Social Science Research Could Look Like: Confidence Intervals for Effect Sizes , 2002 .

[12]  F. Schmidt Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers , 1996 .

[13]  William K. Estes,et al.  On the communication of information by displays of standard errors and confidence intervals , 1997 .

[14]  G. Cumming,et al.  Inference by eye: confidence intervals and how to read pictures of data. , 2005, The American psychologist.

[15]  G. Cumming,et al.  Replication and Researchers' Understanding of Confidence Intervals and Standard Error Bars. , 2004 .

[16]  M. Gardner,et al.  Confidence intervals rather than P values: estimation rather than hypothesis testing. , 1986, British medical journal.

[17]  Jeffrey N Rouder,et al.  Relational and Arelational Confidence Intervals , 2005, Psychological science.