Compressing Massive Geophysical Datasets Using Vector Quantization

This article presents a procedure for compressing massive geophysical datasets. A dataset is stratified geographically, and a penalized clustering algorithm applied to each stratum independently. The algorithm, called Monte Carlo extended ECVQ, is based on the entropy-constrained vector quantizer algorithm (ECVQ). ECVQ trades off error induced by compression against data reduction to produce a set of representative points, each of which stands for some number of input observations. Since the data are massive, a preliminary set of representatives is determined from a stratum sample, then the full stratum is clustered by assigning each observation to the nearest representative. After replacing the initial representatives by means of these clusters, the new representatives and their associated counts are a compressed version, or summary, of the original stratum data. With the initial set of representatives determined from a sample, the final summary is subject to sampling variation. A statistical model for the relationship between compressed and uncompressed data provides a framework for assessing this variability. Test data from the International Satellite Cloud Climatology Project are used to demonstrate the procedure.