ETL WORKFLOW GENERATION FOR OFFLOADING DORMANT DATA FROM THE DATA WAREHOUSE TO HADOOP

The technologies developed to address the needs of Big Data have presented a vast number of beneficial opportunities for use alongside the traditional Data Warehouse (DW). There are several proposed use cases for using Apache Hadoop as a compliment to traditional DWs as a Big Data platform. One of these use cases is the offloading of "dormant data" that is, infrequently used or inactive data from the DW relational database management system (RDBMS) to Hadoop Distributed File System (HDFS) for long-term archiving in an active and query-able state. The primary goal of this work is to explore and define the process by which Extract-TransformLoad (ETL) workflows can be generated utilizing applicable tools to achieve such a data offloading solution. Additional focuses of this project include conducting experiments to measure DW query performance before and after the Hadoop archive implementation and to provide analysis on the usability of the HDFS “active archive.” This paper discusses the cost-savings and performance gains to the DW given this approach. The process defined in our research and experimentation suggests the feasibility of the development of fully-automated ETL workflows for the offloading and archiving of data to Hadoop.