Model-Based Semantic Compression for Network-Data Tables

While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose Model-Based Semantic Compression (MBSC), a novel data compression framework that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. We describe the architecture and algorithms underlying SPARTAN, a model-based semantic compression system that exploits predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. Our experimentation with several real-life data sets has offered convincing evidence of the effectiveness of SPARTAN's model-based approach -- SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference. Several promising directions for future research and possible applications of MBSC in the context of network management are identified and discussed.

[1]  Weiru Liu,et al.  Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[2]  Kenneth Ward Church,et al.  Engineering the compression of massive tables: an experimental approach , 2000, SODA '00.

[3]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[4]  Anja Feldmann,et al.  Measurement and analysis of IP network usage and behavior , 2000, IEEE Commun. Mag..

[5]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[6]  Rajeev Rastogi,et al.  Efficiently monitoring bandwidth and latency in IP networks , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[7]  Michael Randolph Garey,et al.  Johnson: "computers and intractability , 1979 .

[8]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[9]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[10]  H. V. Jagadish,et al.  Semantic Compression and Pattern Extraction with Fascicles , 1999, VLDB.

[11]  Matthias Grossglauser,et al.  Trajectory sampling for direct traffic observation , 2000, SIGCOMM 2000.

[12]  Chinya V. Ravishankar,et al.  Block-Oriented Compression Techniques for Large Statistical Databases , 1997, IEEE Trans. Knowl. Data Eng..

[13]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  Kyuseok Shim,et al.  PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning , 1998, Data Mining and Knowledge Discovery.

[16]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[17]  Rajeev Rastogi,et al.  SPARTAN: a model-based semantic compression system for massive data tables , 2001, SIGMOD '01.

[18]  Mario Silva-Neto,et al.  Netflow services and applications , 2002 .