Sublinear Algorithms for Approximating String Compressibility

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its ℓth subword complexity , for small ℓ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.

[1]  Ronitt Rubinfeld,et al.  Sublinear Algorithms for Approximating String Compressibility and the Distribution Support Size , 2005, Electron. Colloquium Comput. Complex..

[2]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[3]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[4]  Aldo de Luca,et al.  On the Combinatorics of Finite Words , 1999, Theor. Comput. Sci..

[5]  Patrice Séébold,et al.  Proof of a conjecture on word complexity , 2001 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Jeffrey Shallit,et al.  On the maximum number of distinct factors of a binary string , 1993, Graphs Comb..

[8]  Paul M. B. Vitányi,et al.  Similarity of Objects and the Meaning of Words , 2006, TAMC.

[9]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[10]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[11]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[12]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[13]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[14]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[15]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[16]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[17]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[18]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[19]  Graham Cormode,et al.  Substring compression problems , 2005, SODA '05.

[20]  Sanjeev R. Kulkarni,et al.  Universal entropy estimation via block sorting , 2004, IEEE Transactions on Information Theory.

[21]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[22]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[23]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[24]  Zoltán Kása,et al.  On the d-complexity of strings , 2010, ArXiv.

[25]  Alex Samorodnitsky,et al.  Approximating entropy from sublinear samples , 2007, SODA '07.

[26]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[27]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[28]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[29]  Ian H. Witten,et al.  Text mining: a new frontier for lossless compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[30]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[31]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[32]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[33]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[34]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[35]  Abhi Shelat,et al.  Approximation algorithms for grammar-based compression , 2002, SODA '02.

[36]  Eamonn J. Keogh,et al.  Compression-Based Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[37]  Ronitt Rubinfeld,et al.  Sublinear Algorithms for Approximating String Compressibility , 2007, APPROX-RANDOM.

[38]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[39]  Paul C. Shields,et al.  Sequences Incompressible by SLZ (LZW), Yet Fully Compressible by ULZ , 2000 .

[40]  Kaizhong Zhang,et al.  Repetition Complexity of Words , 2002, DCFS.

[41]  Charles K. Chui,et al.  An Introduction to Wavelets , 1992 .

[42]  Moshe Lewenstein,et al.  Generalized Substring Compression , 2009, CPM.

[43]  Mark Daniel Ward,et al.  On Correlation Polynomials and Subword Complexity , 2007 .

[44]  Alex Samorodnitsky,et al.  Approximating the entropy of large alphabets , 2005, Electron. Colloquium Comput. Complex..

[45]  Svante Janson,et al.  On the average sequence complexity , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[46]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..