Parallel Secondo: Boosting Database Engines with Hadoop

Hadoop is an efficient and simple parallel framework following the Map Reduce paradigm, and making the parallel processing recently become a hot issue in data-intensive applications. Since Hadoop can be easily deployed on large-scale clusters including up to thousands of computers, various studies intend to process common relational database operations also on this new platform and expect to achieve a remarkable performance. However, these works have to prepare customized programs according to different input format, making the communication between co-workers difficult. Additionally, all intermediate data have to be transformed to key-value pairs and then transferred through the underlying HDFS, making the data processable by Map and Reduce tasks and keeping a balanced workload on the cluster. During this period, unnecessary overhead decreases both the speed-up and scale-up of these systems. Therefore, this paper attempts to propose a light and efficient coupling structure thus to combine Hadoop with single-computer databases on the engine level. On one hand, it uses a well-designed parallel data model to make end-users represent parallel queries like common queries. All current and future data types and algorithms can be used directly, having no need to be specifically changed for the parallel platform. On the other hand, it provides a simple and independent distributed file system to transfer data among database engines directly, without passing through HDFS, hence to remove as much as possible unnecessary transform and transfer overhead. For purpose of demonstration, a prototype Parallel Secondo is introduced in this paper. It has been fully evaluated in both small and large scale clusters, achieving satisfactory performances for different database operations.

[1]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[2]  Vldb Endowment,et al.  The VLDB journal : the international journal on very large data bases. , 1992 .

[3]  Ralf Hartmut Güting,et al.  BerlinMOD: a benchmark for moving object databases , 2009, The VLDB Journal.

[4]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Markus Schneider,et al.  A foundation for representing and querying moving objects , 2000, TODS.

[7]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[8]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[10]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Yu Xu,et al.  Integrating hadoop and parallel DBMs , 2010, SIGMOD Conference.

[13]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[14]  Zhiyong Xu,et al.  SJMR: Parallelizing spatial join with MapReduce on clusters , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.