The technologies developed to address the needs of Big Data have presented a vast number of beneficial opportunities for use alongside the traditional Data Warehouse (DW). There are several proposed use cases for using Apache Hadoop as a compliment to traditional DWs as a Big Data platform. One of these use cases is the offloading of "dormant data" that is, infrequently used or inactive data from the DW relational database management system (RDBMS) to Hadoop Distributed File System (HDFS) for long-term archiving in an active and query-able state. The primary goal of this work is to explore and define the process by which Extract-TransformLoad (ETL) workflows can be generated utilizing applicable tools to achieve such a data offloading solution. Additional focuses of this project include conducting experiments to measure DW query performance before and after the Hadoop archive implementation and to provide analysis on the usability of the HDFS “active archive.” This paper discusses the cost-savings and performance gains to the DW given this approach. The process defined in our research and experimentation suggests the feasibility of the development of fully-automated ETL workflows for the offloading and archiving of data to Hadoop.
[1]
Dean Wampler,et al.
Programming Hive
,
2012
.
[2]
Edward Walker,et al.
The Real Cost of a CPU Hour
,
2009,
Computer.
[3]
Ralph Kimball.
The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics
,
2011
.
[4]
Helmut Krcmar,et al.
Data Warehouse Design
,
2008
.
[5]
Eric Sammer.
Hadoop Operations
,
2012
.
[6]
Extract , Transform , and Load Big Data with Apache Hadoop *
,
.
[7]
L. Rönnbäck,et al.
Anchor modeling - Agile information modeling in evolving data environments
,
2010,
Data Knowl. Eng..
[8]
Jason Lefler,et al.
Data Modeling Considerations in Hadoop and Hive
,
2013
.
[9]
Vladan Jovanovic,et al.
Conceptual Data Vault Model
,
2012
.