How bandwidth selection algorithms impact exploratory data analysis using kernel density estimation.

Exploratory data analysis (EDA) can reveal important features of underlying distributions, and these features often have an impact on inferences and conclusions drawn from data. Graphical analysis is central to EDA, and graphical representations of distributions often benefit from smoothing. A viable method of estimating and graphing the underlying density in EDA is kernel density estimation (KDE). This article provides an introduction to KDE and examines alternative methods for specifying the smoothing bandwidth in terms of their ability to recover the true density. We also illustrate the comparison and use of KDE methods with 2 empirical examples. Simulations were carried out in which we compared 8 bandwidth selection methods (Sheather-Jones plug-in [SJDP], normal rule of thumb, Silverman's rule of thumb, least squares cross-validation, biased cross-validation, and 3 adaptive kernel estimators) using 5 true density shapes (standard normal, positively skewed, bimodal, skewed bimodal, and standard lognormal) and 9 sample sizes (15, 25, 50, 75, 100, 250, 500, 1,000, 2,000). Results indicate that, overall, SJDP outperformed all methods. However, for smaller sample sizes (25 to 100) either biased cross-validation or Silverman's rule of thumb was recommended, and for larger sample sizes the adaptive kernel estimator with SJDP was recommended. Information is provided about implementing the recommendations in the R computing language.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  Lynne Stokes Introduction to Variance Estimation , 2008 .

[3]  C. Loader Bandwidth selection: classical or plug-in? , 1999 .

[4]  Fernando Marmolejo-Ramos,et al.  Getting the most from your curves: Exploring and reporting data using informative graphical techniques , 2009 .

[5]  M. Malek Vector Calculus , 2014 .

[6]  M. Rudemo Empirical Choice of Histograms and Kernel Density Estimators , 1982 .

[7]  S. Sheather Density Estimation , 2004 .

[8]  Donald Fraser,et al.  Nonparametric Estimation IV , 1951 .

[9]  A SIMULATION STUDY FOR THE BANDWIDTH SELECTION IN THE KERNEL DENSITY ESTIMATION BASED ON THE EXACT AND THE ASYMPTOTIC MISE , 2010 .

[10]  D. W. Scott,et al.  Biased and Unbiased Cross-Validation in Density Estimation , 1987 .

[11]  Gerda Claeskens,et al.  Nonparametric Estimation , 2011, International Encyclopedia of Statistical Science.

[12]  A. Bowman An alternative method of cross-validation for the smoothing of density estimates , 1984 .

[13]  J. Marron,et al.  Progress in data-based bandwidth selection for kernel density estimation , 1996 .

[14]  Paul M. Salkovskis,et al.  The Validation of a New Obsessive-Compulsive Disorder Scale: The Obsessive-Compulsive Inventory , 1998 .

[15]  M. Wand,et al.  EXACT MEAN INTEGRATED SQUARED ERROR , 1992 .

[16]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[17]  E. Nadaraya,et al.  On the Integral Mean Square Error of Some Nonparametric Estimates for the Density Function , 1974 .

[18]  L. Thompson,et al.  A Look into the Mind of the Negotiator: Mental Models in Negotiation , 2003 .

[19]  Rand R. Wilcox,et al.  Graphical Methods for Assessing Effect Size: Some Alternatives to Cohen's d , 2006 .

[20]  J. Abramowitz,et al.  Relationship Between Obsessive Beliefs and Obsessive–Compulsive Symptoms , 2003, Cognitive Therapy and Research.

[21]  A. Bowman,et al.  Applied smoothing techniques for data analysis : the kernel approach with S-plus illustrations , 1999 .

[22]  James Stephen Marron,et al.  Comparison of data-driven bandwith selectors , 1988 .

[23]  M. C. Jones,et al.  Universal smoothing factor selection in density estimation: theory and practice , 1997 .

[24]  Rex B. Kline,et al.  Becoming a Behavioral Science Researcher: A Guide to Producing Research That Matters , 2008 .

[25]  M. Woodroofe On Choosing a Delta-Sequence , 1970 .

[26]  Isaias Hazarmabeth Salgado-Ugarte,et al.  Exploring the Use of Variable Bandwidth Kernel Density Estimators , 2003 .

[27]  L. Osberg,et al.  “Fair” Inequality? Attitudes toward Pay Differentials: The United States in Comparative Perspective , 2006 .

[28]  H. Akiskal,et al.  The DSM-IV and ICD-10 categories of recurrent [major] depressive and bipolar II disorders: evidence that they lie on a dimensional spectrum. , 2006, Journal of affective disorders.

[29]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[30]  Kernel Density Estimators: An Approach to Understanding How Groups Differ , 2004 .

[31]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[32]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[33]  John T. Behrens,et al.  Principles and procedures of exploratory data analysis. , 1997 .

[34]  Mitch Earleywine,et al.  Validation of the Center for Epidemiologic Studies Depression Scale—Revised (CESD-R): Pragmatic depression assessment in the general population , 2011, Psychiatry Research.

[35]  Kenneth Lange,et al.  Numerical analysis for statisticians , 1999 .

[36]  R. Carleton,et al.  Fear of Physical Harm: Factor Structure and Psychometric Properties of the Injury/Illness Sensitivity Index , 2005 .

[37]  Amy J. C. Cuddy,et al.  A model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition. , 2002, Journal of personality and social psychology.

[38]  A. Cuevas,et al.  A comparative study of several smoothing methods in density estimation , 1994 .