The Danger of Testing by Selecting Controlled Subsets, with Applications to Spoken-Word Recognition

When examining the effects of a continuous variable x on an outcome y, a researcher might choose to dichotomize on x, dividing the population into two sets—low x and high x—and testing whether these two subpopulations differ with respect to y. Dichotomization has long been known to incur a cost in statistical power, but there remain circumstances in which it is appealing: an experimenter might use it to control for confounding covariates through subset selection, by carefully choosing a subpopulation of Low and a corresponding subpopulation of High that are balanced with respect to a list of control variables, and then comparing the subpopulations’ y values. This “divide, select, and test” approach is used in many papers throughout the psycholinguistics literature, and elsewhere. Here we show that, despite the apparent innocuousness, these methodological choices can lead to erroneous results, in two ways. First, if the balanced subsets of Low and High are selected in certain ways, it is possible to conclude a relationship between x and y not present in the full population. Specifically, we show that previously published conclusions drawn from this methodology—about the effect of a particular lexical property on spoken-word recognition—do not in fact appear to hold. Second, if the balanced subsets of Low and High are selected randomly, this methodology frequently fails to show a relationship between x and y that is present in the full population. Our work uncovers a new facet of an ongoing research effort: to identify and reveal the implicit freedoms of experimental design that can lead to false conclusions.

[1]  Leif D. Nelson,et al.  False-Positive Citations , 2018, Perspectives on psychological science : a journal of the Association for Psychological Science.

[2]  E. Rieger,et al.  Reduced Inhibition of Return to Food Images in Obese Individuals , 2015, PloS one.

[3]  Marc Brys,et al.  Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English , 2009 .

[4]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[5]  Donald Eugene. Farrar,et al.  Multicollinearity in Regression Analysis; the Problem Revisited , 2011 .

[6]  E. Picano,et al.  Madness and method in stress echo reading. , 1999, European heart journal.

[7]  A. Gelman,et al.  Splitting a Predictor at the Upper Quarter or Third and the Lower Quarter or Third , 2007 .

[8]  Mark Yates,et al.  How the clustering of phonological neighbors affects visual word recognition. , 2013, Journal of experimental psychology. Learning, memory, and cognition.

[9]  D. Rubin For objective causal inference, design trumps analysis , 2008, 0811.1640.

[10]  K. Johnstone Client-Acceptance Decisions: Simultaneous Effects of Client Business Risk, Audit Risk, Auditor Business Risk, and Risk Adaptation , 2006 .

[11]  K I Forster,et al.  The potential for experimenter bias effects in word recognition experiments , 2000, Memory & cognition.

[12]  D. Barr,et al.  Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[13]  Andrew Gelman,et al.  Measurement error and the replication crisis , 2017, Science.

[14]  Jacob Cohen The Cost of Dichotomization , 1983 .

[15]  P. Luce,et al.  When Words Compete: Levels of Processing in Perception of Spoken Words , 1998 .

[16]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[17]  D. Pisoni,et al.  Recognizing Spoken Words: The Neighborhood Activation Model , 1998, Ear and hearing.

[18]  Cynthia S. Q. Siew,et al.  The influence of 2-hop network density on spoken word recognition , 2017, Psychonomic bulletin & review.

[19]  G. Pond,et al.  Statistical issues in the use of dynamic allocation methods for balancing baseline covariates , 2011, British Journal of Cancer.

[20]  N. Thomas,et al.  Cash for carbon: A randomized trial of payments for ecosystem services to reduce deforestation , 2017, Science.

[21]  Matthew J. Schneider,et al.  The Median Split: Robust, Refined, and Revived , 2015 .

[22]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[23]  Maarten Casteren,et al.  Match: A program to assist in matching the conditions of factorial experiments , 2007, Behavior research methods.

[24]  J. Mullennix,et al.  Word familiarity and frequency in visual and auditory word recognition. , 1990, Journal of experimental psychology. Learning, memory, and cognition.

[25]  D. Pisoni,et al.  Phonotactics, Neighborhood Activation, and Lexical Access for Spoken Words , 1999, Brain and Language.

[26]  Leslie A. Zebrowitz,et al.  Mere Exposure and Racial Prejudice: Exposure to Other-Race Faces Increases Liking for Strangers of That Race. , 2008, Social cognition.

[27]  Marcello D'Orazio,et al.  Statistical Matching: Theory and Practice (Wiley Series in Survey Methodology) , 2006 .

[28]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[29]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[30]  H. H. Clark The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. , 1973 .

[31]  E. Kensinger,et al.  Emotion's influence on memory for spatial and temporal context , 2011, Cognition & emotion.

[32]  Marco Caliendo,et al.  Some Practical Guidance for the Implementation of Propensity Score Matching , 2005, SSRN Electronic Journal.

[33]  M. Vitevitch,et al.  The influence of the phonological neighborhood clustering coefficient on spoken word recognition. , 2009, Journal of experimental psychology. Human perception and performance.

[34]  M. Vitevitch The spread of the phonological neighborhood influences spoken word recognition , 2007, Memory & cognition.

[35]  Paul R. Rosenbaum,et al.  Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms , 1993 .

[36]  J. Brooks Why most published research findings are false: Ioannidis JP, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece , 2008 .

[37]  H. Savin Word‐Frequency Effect and Errors in the Perception of Speech , 1963 .

[38]  Samantha F. Anderson,et al.  Addressing the “Replication Crisis”: Using Original Studies to Design Replication Studies with Appropriate Statistical Power , 2017, Multivariate behavioral research.

[39]  K. Moeller,et al.  Using propensity score matching to construct experimental stimuli , 2017, Behavior research methods.

[40]  S. Goldinger,et al.  Priming Lexical Neighbors of Spoken Words: Effects of Competition and Inhibition. , 1989, Journal of memory and language.

[41]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[42]  Richard A. Nielsen,et al.  Why Propensity Scores Should Not Be Used for Matching , 2019, Political Analysis.

[43]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[44]  Daniel J. Simons,et al.  Constraints on Generality (COG): A Proposed Addition to All Empirical Papers , 2017, Perspectives on psychological science : a journal of the Association for Psychological Science.

[45]  R. Lalonde Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[46]  Rebecca Treiman,et al.  The English Lexicon Project , 2007, Behavior research methods.

[47]  Julia F. Strand,et al.  Sizing up the competition: quantifying the influence of the mental lexicon on auditory and visual spoken word recognition. , 2011, The Journal of the Acoustical Society of America.

[48]  Brian A. Nosek,et al.  Recommendations for Increasing Replicability in Psychology † , 2013 .

[49]  C. B. Colby The weirdest people in the world , 1973 .

[50]  Richard E. Lucas,et al.  The mini-IPIP scales: tiny-yet-effective measures of the Big Five factors of personality. , 2006, Psychological assessment.

[51]  A. Cutler Making up materials is a confounded nuisance, or: Will we able to run any psycholinguistic experiments at all in 1990? , 1981, Cognition.

[52]  C M Connine,et al.  Auditory word recognition: extrinsic and intrinsic effects of word frequency. , 1993, Journal of experimental psychology. Learning, memory, and cognition.

[53]  David B Pisoni,et al.  Clustering coefficients of lexical neighborhoods: Does neighborhood structure matter in spoken word recognition? , 2010, The mental lexicon.

[54]  Clustering Words to Match Conditions: An Algorithm for Stimuli Selection in Factorial Designs. , 2017 .

[55]  Kristopher J Preacher,et al.  A researcher's guide to regression, discretization, and median splits of continuous variables , 2015 .

[56]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[57]  W. Marslen-Wilson,et al.  The temporal structure of spoken language understanding , 1980, Cognition.

[58]  Daragh E. Sibley,et al.  Error, error everywhere: A look at megastudies of word reading , 2009 .

[59]  M. Vitevitch Influence of onset density on spoken-word recognition. , 2002, Journal of experimental psychology. Human perception and performance.

[60]  John G. Lynch,et al.  Median splits, Type II errors, and false–positive consumer psychology: Don't fight the power , 2015 .

[61]  Kevin Arceneaux,et al.  A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark , 2010 .

[62]  T. Jaeger,et al.  Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. , 2008, Journal of memory and language.

[63]  Benjamin V. Tucker,et al.  The Massive Auditory Lexical Decision (MALD) database , 2018, Behavior Research Methods.

[64]  Dawn Iacobucci,et al.  Toward a More Nuanced Understanding of the Statistical Properties of a Median Split , 2015 .

[65]  R. Baayen,et al.  Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[66]  Robbie C. M. van Aert,et al.  Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking , 2016, Front. Psychol..

[67]  J. Bouchaud,et al.  Why Do Markets Crash? Bitcoin Data Offers Unprecedented Insights , 2015, PloS one.

[68]  Blair C Armstrong,et al.  SOS! An algorithm and software for the stochastic optimization of stimuli , 2012, Behavior research methods.