Data-Based Choice of Histogram Bin Width

Abstract The most important parameter of a histogram is the bin width because it controls the tradeoff between presenting a picture with too much detail (“undersmoothing”) or too little detail (“oversmoothing”) with respect to the true distribution. Despite this importance there has been surprisingly little research into estimation of the “optimal” bin width. Default bin widths in most common statistical packages are, at least for large samples, quite far from the optimal bin width. Rules proposed by, for example, Scott lead to better large sample performance of the histogram, but are not consistent themselves. In this paper we extend the bin width rules of Scott to those that achieve root-n rates of convergence to the L 2-optimal bin width, thereby providing firm scientific justification for their use. Moreover, the proposed rules are simple, easy and fast to compute, and perform well in simulations.

[1]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[2]  M. Woodroofe On Choosing a Delta-Sequence , 1970 .

[3]  David P. Doane,et al.  Aesthetic Frequency Classifications , 1976 .

[4]  D. W. Scott On optimal and data based histograms , 1979 .

[5]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[6]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[7]  P. Hall Central limit theorem for integrated square error of multivariate nonparametric density estimators , 1984 .

[8]  James Stephen Marron,et al.  Comparison of data-driven bandwith selectors , 1988 .

[9]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[10]  M. C. Jones,et al.  On optimal data-based bandwidth selection in kernel density estimation , 1991 .

[11]  Simon J. Sheather,et al.  Using non stochastic terms to advantage in kernel-based estimation of integrated squared density derivatives , 1991 .

[12]  James Stephen Marron,et al.  Lower bounds for bandwidth selection in density estimation , 1991 .

[13]  Brian Kent Aldershof,et al.  Estimation of integrated squared density derivatives , 1991 .

[14]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[15]  James Stephen Marron,et al.  Best Possible Constant for Bandwidth Selection , 1992 .

[16]  James Stephen Marron,et al.  On the use of pilot estimators in bandwidth selection , 1992 .

[17]  M. Wand,et al.  EXACT MEAN INTEGRATED SQUARED ERROR , 1992 .

[18]  Jianqing Fan,et al.  Fast implementations of nonparametric curve estimators , 1993 .

[19]  M. Wand Fast Computation of Multivariate Kernel Estimators , 1994 .

[20]  Joachim Engel,et al.  An iterative bandwidth selector for kernel estimation of densities and their derivatives , 1994 .

[21]  Scale measures for bandwidth selection , 1995 .

[22]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[23]  M. Wand,et al.  Accuracy of Binned Kernel Functional Approximations , 1995 .