Statistical Conclusion Validity: Some Common Threats and Simple Remedies

The ultimate goal of research is to produce dependable knowledge or to provide the evidence that may guide practical decisions. Statistical conclusion validity (SCV) holds when the conclusions of a research study are founded on an adequate analysis of the data, generally meaning that adequate statistical methods are used whose small-sample behavior is accurate, besides being logically capable of providing an answer to the research question. Compared to the three other traditional aspects of research validity (external validity, internal validity, and construct validity), interest in SCV has recently grown on evidence that inadequate data analyses are sometimes carried out which yield conclusions that a proper analysis of the data would not have supported. This paper discusses evidence of three common threats to SCV that arise from widespread recommendations or practices in data analysis, namely, the use of repeated testing and optional stopping without control of Type-I error rates, the recommendation to check the assumptions of statistical tests, and the use of regression whenever a bivariate relation or the equivalence between two variables is studied. For each of these threats, examples are presented and alternative practices that safeguard SCV are discussed. Educational and editorial changes that may improve the SCV of published research are also discussed.

[1]  W. Kallenberg,et al.  The asymptotic behavior of tests for normal means based on a variance pre-test , 2000 .

[2]  B. Green The perception of distance and location for dual tactile pressures , 1982, Perception & psychophysics.

[3]  Herold Dehling,et al.  Robust nonparametric tests for the two-sample location problem , 2011, Stat. Methods Appl..

[4]  M. García-Pérez,et al.  Statistical Inference Involving Binomial and Negative Binomial Parameters , 2009, The Spanish journal of psychology.

[5]  E. Peli,et al.  Psychometric functions for detection and discrimination with and without flankers , 2011, Attention, perception & psychophysics.

[6]  R. Wilcox,et al.  A comparison of two-stage procedures for testing least-squares coefficients under heteroscedasticity. , 2011, The British journal of mathematical and statistical psychology.

[7]  Raymond S. Nickerson,et al.  What authors want from journal reviewers and editors , 2005 .

[8]  K. Ottenbacher Statistical Conclusion Validity of Early Intervention Research with Handicapped Children , 1989, Exceptional children.

[9]  Bernard C. Beins Research Methods: A Tool for Life , 2008 .

[10]  Edward Vul,et al.  Reply to Comments on “Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition” , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[11]  Show-Li Jan,et al.  Optimal sample sizes for precise interval estimation of Welch’s procedure under various allocation and cost considerations , 2012, Behavior research methods.

[12]  Meinhard Kieser,et al.  A closer look at the effect of preliminary goodness-of-fit testing for normality for the one-sample t-test. , 2011, The British journal of mathematical and statistical psychology.

[13]  K. Saberi,et al.  A detection-theoretic model of echo inhibition. , 2004, Psychological review.

[14]  Michael B. Miller,et al.  The principled control of false positives in neuroimaging. , 2009, Social cognitive and affective neuroscience.

[15]  Karl Moder,et al.  The two-sample t test: pre-testing its assumptions does not pay off , 2011 .

[16]  Denny Borsboom,et al.  The attack of the psychometricians , 2006, Psychometrika.

[17]  R B D'Agostino,et al.  Robustness of the t Test Applied to Data Distorted from Normality by Floor Effects , 1992, Journal of dental research.

[18]  A. Baddeley,et al.  Prose recall and amnesia: implications for the structure of working memory , 2002, Neuropsychologia.

[19]  Rink Hoekstra,et al.  Are Assumptions of Well-Known Statistical Techniques Checked, and Why (Not)? , 2012, Front. Psychology.

[20]  Andrew F Hayes,et al.  Further evaluating the conditional decision rule for comparing two independent means. , 2007, The British journal of mathematical and statistical psychology.

[21]  Paul D. Isaac,et al.  Linear regression, structural relations, and measurement error. , 1970 .

[22]  W. Dunlap,et al.  Sequential Anovas and Type I Error Rates , 1992 .

[23]  J. Revuelta,et al.  Extending the CLAST sequential rule to one-way ANOVA under group sampling , 2007, Behavior research methods.

[24]  The effect of preliminary normality goodness of fit tests on subsequent inference. , 1978 .

[25]  Barbara Lee Statistical Conclusion Validity in Ex Post Facto Designs: Practicality in Evaluation , 1985 .

[26]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[27]  R. Frick,et al.  A better stopping rule for conventional statistical tests , 1998 .

[28]  C. Wells,et al.  Dealing with assumptions underlying statistical tests , 2007 .

[29]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[30]  D. A. Fitts Minimizing animal numbers: the variable-criteria sequential stopping rule. , 2011, Comparative medicine.

[31]  S. Maxwell,et al.  Linked Raters' Judgments: Combating Problems of Statistical Conclusion Validity , 1983 .

[32]  John W. Van Ness,et al.  On Estimating Linear Relationships When Both Variables are Subject to Errors , 1994 .

[33]  K. Ottenbacher,et al.  How to detect effects: statistical power and evidence-based practice in occupational therapy research. , 1999, The American journal of occupational therapy : official publication of the American Occupational Therapy Association.

[34]  A. R. Othman,et al.  The New and Improved Two-Sample t Test , 2004, Psychological science.

[35]  L. Finlay,et al.  Evaluating Research Articles , 1997 .

[36]  W. R. Schucany,et al.  Preliminary Goodness-of-Fit Tests for Normality do not Validate the One-Sample Student t , 2006 .

[37]  I. Ibragimov,et al.  On Sequential Estimation , 1975 .

[38]  Douglas M Hawkins,et al.  Diagnostics for conformity of paired quantitative measurements , 2002, Statistics in medicine.

[39]  James T. Austin,et al.  Statistical Conclusion Validity for Organizational Science Researchers: A Review , 1998 .

[40]  Alan E. Kazdin,et al.  Graduate Training in Statistics, Methodology, and Measurement in Psychology: A Survey of PhD Programs in North America , 1990 .

[41]  Andrew J. Sinclair,et al.  Mask-dependent attentional cuing effects in visual signal detection: The psychometric function for contrast , 2004, Perception & psychophysics.

[42]  M. García-Pérez,et al.  Testing Equivalence with Repeated Measures: Tests of the Difference Model of Two-Alternative Forced-Choice Performance , 2011, The Spanish journal of psychology.

[43]  W. Matthews,et al.  What might judgment and decision making research be like if we took a Bayesian approach to hypothesis testing? , 2011, Judgment and Decision Making.

[44]  F. J. Anscombe,et al.  Fixed-Sample-Size Analysis of Sequential Observations , 1954 .

[45]  T. Lumley,et al.  The importance of the normality assumption in large public health data sets. , 2002, Annual review of public health.

[46]  D. A. Fitts Improved stopping rules for the design of efficient small-sample experiments in biomedical and biobehavioral research , 2010, Behavior research methods.

[47]  D. W. Zimmerman,et al.  Some Properties of Preliminary Tests of Equality of Variances in the Two-Sample Location Problem , 1996 .

[48]  Graham Dunn,et al.  Regression Models for Method Comparison Data , 2007, Journal of biopharmaceutical statistics.

[49]  W. Stevens,et al.  Fiducial limits of the parameter of a discontinuous distribution. , 1950, Biometrika.

[50]  T. Cook,et al.  Quasi-experimentation: Design & analysis issues for field settings , 1979 .

[51]  Ronald H. Ketellapper On Estimating Parameters in a Simple Linear Errors-in-Variables Model , 1983 .

[52]  James Friedrich,et al.  Statistical Training in Psychology: A National Survey and Commentary on Undergraduate Programs , 2000 .

[53]  Jeffrey A. Nisen,et al.  A simple method of computing the sample size for Chi-square test for the equality of multinomial distributions , 2008, Comput. Stat. Data Anal..

[54]  P. Armitage,et al.  Repeated Significance Tests on Accumulating Data , 1969 .

[55]  Z. Shun,et al.  Type I error in sample size re‐estimations based on observed treatment difference , 2001, Statistics in medicine.

[56]  M. García-Pérez,et al.  On the discrepant results in synchrony judgment and temporal-order judgment tasks: a quantitative model , 2012, Psychonomic bulletin & review.

[57]  J. S. Alper,et al.  Biases in summary statistics of slopes and intercepts in linear regression with errors in both variables. , 1995, Talanta.

[58]  G Dunn,et al.  Modelling method comparison data , 1999, Statistical methods in medical research.

[59]  E. Wagenmakers A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.

[60]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[61]  Joseph R. Rausch,et al.  Sample size planning for statistical power and accuracy in parameter estimation. , 2008, Annual review of psychology.

[62]  Leland Wilkinson,et al.  Statistical Methods in Psychology Journals Guidelines and Explanations , 2005 .

[63]  J. Pedoe,et al.  Sequential Methods in Statistics , 1966 .

[64]  Stephen G West,et al.  Doctoral training in statistics, measurement, and methodology in psychology: replication and extension of Aiken, West, Sechrest, and Reno's (1990) survey of PhD programs in North America. , 2008, The American psychologist.

[65]  L. Cronbach The two disciplines of scientific psychology. , 1957 .

[66]  M. García-Pérez On the Confidence Interval for the Binomial Parameter , 2005 .

[67]  H. Keselman,et al.  Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[68]  Douglas A Fitts,et al.  Ethics and animal numbers: informal analyses, uncertain sample sizes, inefficient replications, and type I errors. , 2011, Journal of the American Association for Laboratory Animal Science : JAALAS.

[69]  G. Loewenstein,et al.  Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling , 2012, Psychological science.

[70]  D. W. Zimmerman,et al.  A simple and effective decision rule for choosing a significance test to protect against non-normality. , 2011, The British journal of mathematical and statistical psychology.

[71]  R. Wilcox New Methods for Comparing Groups , 2005 .

[72]  P M Rabbitt,et al.  Alcohol, reaction time and memory: a meta-analysis. , 1993, British journal of psychology.

[73]  Cyrus R Mehta,et al.  Adaptive increase in sample size when interim results are promising: A practical guide with examples , 2011, Statistics in medicine.

[74]  L. Maloney,et al.  Bias and sensitivity in two-interval forced choice procedures: Tests of the difference model , 2008, Vision Research.

[75]  Denny Borsboom,et al.  Letting the daylight in: Reviewing the reviewers and other ways to maximize transparency in science , 2012, Front. Comput. Neurosci..

[76]  D. A. Fitts The variable-criteria sequential stopping rule: Generality to unequal sample sizes, unequal variances, or to large ANOVAs , 2010, Behavior research methods.

[77]  G. W. Milligan,et al.  Statistical conclusion validity in experimental designs used in business research , 1984 .

[78]  S C Draine,et al.  Replicable unconscious semantic priming. , 1998, Journal of experimental psychology. General.

[79]  M. García-Pérez,et al.  Bayesian adaptive estimation of arbitrary points on a psychometric function. , 2007, The British journal of mathematical and statistical psychology.

[80]  David T. Morse,et al.  Minsize2: a Computer Program for Determining Effect Size and Minimum Sample Size for Statistical Significance for Univariate, Multivariate, and Nonparametric Tests , 1999 .

[81]  Alexander Kukush,et al.  Measurement Error Models , 2011, International Encyclopedia of Statistical Science.

[82]  D. Campbell,et al.  EXPERIMENTAL AND QUASI-EXPERIMENT Al DESIGNS FOR RESEARCH , 2012 .

[83]  G. Vecchiato,et al.  The issue of multiple univariate comparisons in the context of neuroelectric brain mapping: An application in a neuromarketing experiment , 2010, Journal of Neuroscience Methods.

[84]  M. García-Pérez,et al.  The role of parametric assumptions in adaptive Bayesian estimation. , 2004, Psychological methods.

[85]  A. Greenwald,et al.  Activation by marginally perceptible ("subliminal") stimuli: dissociation of unconscious from conscious cognition. , 1995, Journal of experimental psychology. General.

[86]  H. Pashler,et al.  Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition 1 , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[87]  M. Treisman,et al.  Relation between signal detectability theory and the traditional procedures for measuring sensory thresholds: Estimating d' from results given by the method of constant simuli. , 1966 .

[88]  Takeshi Amemiya,et al.  Two-Stage Least Squares† , 2014 .

[89]  Test ban: policy of the Journal of Alternative and Complementary Medicine with regard to an increasingly common statistical error. , 2011, Journal of alternative and complementary medicine.

[90]  Douglas G Altman,et al.  Comparisons against baseline within randomised groups are often used and can be highly misleading , 2011, Trials.

[91]  R. Nickerson,et al.  Null hypothesis significance testing: a review of an old and continuing controversy. , 2000, Psychological methods.

[92]  J. Revuelta,et al.  Optimization of sample size in controlled experiments: The CLAST rule , 2006, Behavior research methods.

[93]  Steven B. Caudill Type I Errors after Preliminary Tests for Heteroscedasticity , 1988 .

[94]  S. Addelman,et al.  Fitting straight lines when both variables are subject to error. , 1978, Life sciences.

[95]  J. Orme Statistical Conclusion Validity for Single-System Designs , 1991, Social Service Review.

[96]  Jerald D Kralik,et al.  Rhesus monkeys lack a consistent peak-end effect , 2011, Quarterly journal of experimental psychology.

[97]  Practicing evidence-based psychiatry: 1. Applying a study's findings: The threats to validity approach. , 2010, Asian journal of psychiatry.

[98]  L E Marks,et al.  Differential effects of stimulus context on perceived length: Implications for the horizontal-vertical illusion , 1997, Perception & psychophysics.

[99]  R. Lippa The Relation Between Sex Drive and Sexual Attraction to Men and Women: A Cross-National Study of Heterosexual, Bisexual, and Homosexual Men and Women , 2007, Archives of sexual behavior.

[100]  David T. Morse,et al.  MINSIZE: A Computer Program for Obtaining Minimum Sample Size as an Indicator of Effect Size , 1998 .

[101]  J. Wicherts,et al.  The (mis)reporting of statistical results in psychology journals , 2011, Behavior research methods.

[102]  R. Sternberg On Civility in Reviewing , 2002 .

[103]  Christopher Jennison,et al.  Statistical Approaches to Interim Monitoring of Medical Trials: A Review and Commentary , 1990 .

[104]  F. Graybill Determining Sample Size for a Specified Width Confidence Interval , 1958 .

[105]  Rocío Alcalá-Quintana,et al.  A Comparison of Anchor-Item Designs for the Concurrent Calibration of Large Banks of Likert-Type Items , 2010 .

[106]  David M Erceg-Hurn,et al.  Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. , 2008, The American psychologist.

[107]  Alan Agresti,et al.  Frequentist Performance of Bayesian Confidence Intervals for Comparing Proportions in 2 × 2 Contingency Tables , 2005, Biometrics.

[108]  E. Wagenmakers,et al.  Erroneous analyses of interactions in neuroscience: a problem of significance , 2011, Nature Neuroscience.

[109]  Michael J Strube,et al.  SNOOP: A program for demonstrating the consequences of premature and repeated null hypothesis testing , 2006, Behavior research methods.

[110]  K. Bollen Latent variables in psychology and the social sciences. , 2002, Annual review of psychology.

[111]  Rocío Alcalá-Quintana,et al.  Stopping rules in Bayesian adaptive threshold estimation. , 2005, Spatial vision.

[112]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[113]  Ralph B Dell,et al.  Sample size determination. , 2002, ILAR journal.

[114]  Scott Tonidandel,et al.  Sample size and power calculations in repeated measurement analysis , 2001, Comput. Methods Programs Biomed..

[115]  Edgar Erdfelder,et al.  G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences , 2007, Behavior research methods.

[116]  Ellen R. Girden Evaluating Research Articles from Start to Finish , 1996 .

[117]  L. Delbeke Quasi-experimentation - design and analysis issues for field settings - cook,td, campbell,dt , 1980 .

[118]  P. Bauer,et al.  Evaluation of experiments with adaptive interim analyses. , 1994, Biometrics.

[119]  A. Wald The Fitting of Straight Lines if Both Variables are Subject to Error , 1940 .

[120]  B. Zaslavsky Bayesian Versus Frequentist Hypotheses Testing in Clinical Trials with Dichotomous and Countable Outcomes , 2010, Journal of biopharmaceutical statistics.

[121]  D Malakoff,et al.  Bayes Offers a 'New' Way to Make Sense of Numbers , 1999, Science.

[122]  D. W. Zimmerman A note on preliminary tests of equality of variances. , 2004, The British journal of mathematical and statistical psychology.

[123]  C. James Goodwin,et al.  Research in psychology: Methods and design, 6th ed. , 2010 .

[124]  D. Boos,et al.  How Large Does n Have to be for Z and t Intervals? , 2000 .

[125]  Darrell M. Hull,et al.  Methodology in Our Education Research Culture , 2010 .

[126]  Show-Li Jan,et al.  Optimal sample sizes for Welch’s test under various allocation and cost considerations , 2011, Behavior research methods.

[127]  R Elvik,et al.  Evaluating the statistical conclusion validity of weighted mean results in meta-analysis by analysing funnel graph diagrams. , 1998, Accident; analysis and prevention.

[128]  D. DeMets,et al.  Increasing the sample size when the unblinded interim result is promising , 2004, Statistics in medicine.

[129]  D. J. Gans Use of a preliminary test in comparing two sample means , 1981 .

[130]  Lawrence T. DeCarlo,et al.  Signal detection theory and generalized linear models , 1998 .