A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering

Marriott (1971, Biometrics 27, 501-514) used a heuristic argument to derive the criterion g2 l W I for determining the number of groups in a data set when the clustering objective function is the withingroup determinant I W 1. An analogous argument is employed to derive a criterion for use with the within-group sum-of-squares objective function trace (W). The behaviour of both Marriott's criterion and the new criterion is investigated by Monte Carlo methods. For homogeneous data based on uniform and independent variables, the performance of the new criterion is close to expectation while Marriott's criterion shows much more extreme behaviour. For grouped data, the new criterion correctly identifies the number of groups in 85% of data sets under a wide range of conditions, while Marriott's criterion shows a success rate of less than 40%. The new criterion is illustrated on the wellknown Iris data, and some cautionary comments are made about its use.