Sublinear Algorithms for Approximating String Compressibility

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly. Our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. In addition, we show that approximating the compressibility with respect to LZ is related to approximating the support size of a distribution.

[1]  Ronitt Rubinfeld,et al.  Sublinear Algorithms for Approximating String Compressibility and the Distribution Support Size , 2005, Electron. Colloquium Comput. Complex..

[2]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[3]  Alex Samorodnitsky,et al.  Approximating the entropy of large alphabets , 2005, Electron. Colloquium Comput. Complex..

[4]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[5]  Sanjeev R. Kulkarni,et al.  Universal entropy estimation via block sorting , 2004, IEEE Transactions on Information Theory.

[6]  Kaizhong Zhang,et al.  Repetition Complexity of Words , 2002, DCFS.

[7]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS '00.

[8]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[9]  Alex Samorodnitsky,et al.  Approximating entropy from sublinear samples , 2007, SODA '07.

[10]  Charles K. Chui,et al.  An Introduction to Wavelets , 1992 .

[11]  Vittorio Loreto,et al.  Comment on: Language trees and zipping. Authors' reply , 2003 .

[12]  Paul M. B. Vitányi,et al.  Similarity of Objects and the Meaning of Words , 2006, TAMC.

[13]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[14]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[15]  N. Ahmed,et al.  Discrete Cosine Transform , 2019, IEEE Transactions on Computers.

[16]  Aldo de Luca On the Combinatorics of Finite Words , 1999, Theor. Comput. Sci..

[17]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[18]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[19]  M. D. Ward,et al.  On Correlation Polynomials and Subword Complexity , 2007 .

[20]  Paul C. Shields,et al.  Sequences Incompressible by SLZ (LZW), Yet Fully Compressible by ULZ , 2000 .

[21]  Jeffrey Shallit,et al.  On the maximum number of distinct factors of a binary string , 1993, Graphs Comb..

[22]  E. Caglioti,et al.  Benedetto, Caglioti, and Loreto Reply: , 2003 .

[23]  S. T. Buckland,et al.  An Introduction to the Bootstrap , 1994 .

[24]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[25]  Svante Janson,et al.  On the average sequence complexity , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[26]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[27]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[28]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[29]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[30]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[31]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[32]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[33]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[34]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[35]  Graham Cormode,et al.  Substring compression problems , 2005, SODA '05.

[36]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[37]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[38]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[39]  Ian H. Witten,et al.  Text mining: a new frontier for lossless compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[40]  Patrice Séébold,et al.  Proof of a conjecture on word complexity , 2001 .

[41]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[42]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.