A Variable-Grid Algorithm for Smoothing Clustered Data

SUMMARY An algorithm is described for smoothing data points on the plane. The observation region is divided into cells and each group of four adjoining cells is tested to see whether it can be accepted that the values in the cells come from a distribution with the same underlying value of the parameters. If so, the cells are grouped again in a further round of smoothing. If not, the original values are retained and no further smoothing involving those cells takes place. The method is applied to the following situations: count data assumed to follow a nonhomogeneous Poisson process and count data assumed to follow a negative binomial process with nonhomogeneous rate. This paper had its origins in the attempt to define appropriate smoothing procedures for counts of clustered data in two dimensions, where it could be assumed that there was present both some large-scale spatial variation in the count rate as well as local variation due to clustering. There is of course no method of finally distinguishing between these two types of spatial inhomogeneity, purely on the basis of the data. It is perfectly logically consistent to treat a set of points either as a sample from a Poisson process with rapidly varying density function, or as a sample from a spatially homogeneous clustering process (Bartlett, 1964). Any method for discriminating between the two hypotheses must therefore be based, whether explicitly or otherwise, on some prior assumptions concerning the likely extent of clustering on the one hand or of departures from spatial inhomogeneity on the other. This paper describes a straightforward and computationally rapid algorithm that will allow the user to obtain an impression of the underlying spatial pattern on simple assumptions concerning the nature and extent of local clustering. At the first step in the algorithm, the user nominates a cell size to indicate the dimensions of local spatial clustering. This should be as small as possible while still consistent with the assumption that observations in disjoint cells may be regarded as independently generated. The choice is not very critical, but if chosen too small the procedure will identify as spatial heterogeneity features the user might prefer to regard as clusters. The observation region, supposed to be rectangular, is then divided into cells, the number of divisions along each dimension being equal to some power of 2. Typically, 64 x 64 or 32 x 32 divisions form convenient starting points. For convenience of plotting, the output