A Universal Source Coding Perspective on PPM
暂无分享,去创建一个
The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory. A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values. Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters. For the multi-alphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multi-alphabet code. A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small. Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes. The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for text-like data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well. (Less)