CHAC: An Effective Attribute Clustering Algorithm for Large-Scale Data Processing

Nowadays Hadoop has become a leading architecture for large-scale data processing. One of the efficient ways to accelerate data processing is column-oriented storage technique which has been integrated into Hadoop family recently. However, how to design an appropriate attribute clustering algorithm to achieve optimal data processing performance for column-oriented hadoop system is still a big problem. In this paper, we propose a novel algorithm called CHAC to solve this problem. Both cases of overlapping attribute cluster and non-overlapping attribute cluster are considered in CHAC. In addition, an adjustable parameter is also taken into account to prohibit excessive attribute redundancy via limiting space overhead. The experimental results on TPC-H Benchmark demonstrate the efficiency and effectiveness of the proposed algorithm.

[1]  Shamkant B. Navathe,et al.  Vertical partitioning algorithms for database design , 1984, TODS.

[2]  Shamkant B. Navathe,et al.  Vertical partitioning for database design: a graphical algorithm , 1989, SIGMOD '89.

[3]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[6]  Philip S. Yu,et al.  An Effective Approach to Vertical Partitioning for Physical Design of Relational Databases , 1990, IEEE Trans. Software Eng..

[7]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[8]  Xiaoou Li,et al.  A dynamic vertical partitioning approach for distributed database system , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[9]  Ana Simonet,et al.  Vertical fragmentation in distributed object database systems with complex attributes and methods , 1996, Proceedings of 7th International Conference and Workshop on Database and Expert Systems Applications: DEXA 96.

[10]  Dennis G. Severance,et al.  The use of cluster analysis in physical data base design , 1975, VLDB '75.

[11]  San-Yih Hwang,et al.  Component and data distribution in a distributed workflow , 1998, Proceedings 1998 Asia Pacific Software Engineering Conference (Cat. No.98EX240).

[12]  B. Niamir,et al.  ATTRIBUTE PARTITIONING IN A SELF-ADAPTIVE RELATIONAL DATA BASE SYSTEM , 1978 .

[13]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[14]  Qing Li,et al.  Cost-driven vertical class partitioning for methods in object oriented databases , 2003, The VLDB Journal.

[15]  Wesley W. Chu,et al.  A Transaction-Based Approach to Vertical Partitioning for Relational Database Systems , 1993, IEEE Trans. Software Eng..