Distributed Parallel Architecture for "Big Data"

This paper is an extension to the "Distributed Parallel Architecture for Storing and Processing Large Datasets" paper presented at the WSEAS SEPADS’12 conference in Cambridge. In its original version the paper went over the benefits of using a distributed parallel architecture to store and process large datasets. This paper analyzes the problem of storing, processing and retrieving meaningful insight from petabytes of data. It provides a survey on current distributed and parallel data processing technologies and, based on them, will propose an architecture that can be used to solve the analyzed problem. In this version there is more emphasis put on distributed files systems and the ETL processes involved in a distributed environment.

[1]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[2]  Catalin Boja,et al.  Distributed parallel architecture for storing and processing large datasets , 2012, ICSE 2012.

[3]  栄藤 稔 ビッグデータとパターン認識 ~ More data usually beats better algorithms? ~ , 2011 .

[4]  G. Bruce Berriman,et al.  How Will Astronomy Archives Survive the Data Tsunami? , 2011, ACM Queue.

[5]  Patrick Valduriez,et al.  Distributed and parallel database systems , 1996, CSUR.

[6]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[7]  Ит Информатика History of Hard Disk Drives , 2010 .

[8]  Pat Helland If you have too much data, then 'good enough' is good enough , 2011, CACM.

[9]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10]  Felix Naumann,et al.  METL: Managing and Integrating ETL Processes , 2009, VLDB PhD Workshop.

[11]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[12]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[13]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[14]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[17]  James Bennett,et al.  The Netflix Prize , 2007 .

[18]  Xiaoyang Yu Estimating Language Models Using Hadoop and Hbase , 2008 .

[19]  Sorapak Pukdesree,et al.  Performance evaluation of distributed database on PC cluster computers , 2011 .