A platform for big data analytics on distributed scale-out storage system

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Hadoop-based platform emerges to deal with big data. In Hadoop NameNode is used to store metadata in a single system's memory, which is a performance bottleneck for scale-out. Gluster file system has no performance bottlenecks related to metadata. To achieve massive performance, scalability and fault tolerance for big data analytics, a big data platform is proposed. The proposed big data platform consists of big data storage and big data processing. The Hadoop big data platform and the proposed big data platform are implemented on commodity Linux virtual machines clusters and performance evaluations are conducted. According to the evaluation analysis, the proposed big data platform provides better scalability, fault tolerance, and faster query response time than the Hadoop platform.

[1]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[2]  Aman Sinha,et al.  Partial join order optimization in the paraccel analytic database , 2009, SIGMOD Conference.

[3]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[4]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[5]  Odej Kao,et al.  Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Borislav Iordanov,et al.  HyperGraphDB: A Generalized Graph Database , 2010, WAIM Workshops.

[7]  Kristina Chodorow Scaling MongoDB , 2011 .

[8]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[10]  M. N. Vora,et al.  Hadoop-HBase for large-scale data , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[11]  Jarek Nieplocha,et al.  Evaluation of active storage strategies for the lustre parallel file system , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[12]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[13]  Michael Isard,et al.  Distributed data-parallel computing using a high-level programming language , 2009, SIGMOD Conference.

[14]  Judith Hurwitz,et al.  Big Data For Dummies , 2013 .

[15]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[16]  Abraham Silberschatz,et al.  HadoopDB in action: building real world applications , 2010, SIGMOD Conference.

[17]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[18]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .

[19]  Tim Hawkins,et al.  The Definitive Guide to MongoDB , 2015, Apress.

[20]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[21]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[22]  Felix Naumann,et al.  Meteor/Sopremo: An Extensible Query Language and Operator Model , 2012 .

[23]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[24]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[25]  Kun Li,et al.  GPText: Greenplum parallel statistical text analysis framework , 2013, DanaC '13.

[26]  Dominic Battré,et al.  Massively parallel data analysis with PACTs on Nephele , 2010, Proc. VLDB Endow..

[27]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[28]  George W. Anderson,et al.  Sams Teach Yourself SAP in 24 Hours , 2004 .

[29]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[30]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[31]  Anghel Leonard Pro Hibernate and MongoDB , 2013, Apress.

[32]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[33]  Wolfgang Lehner,et al.  Efficient transaction processing in SAP HANA database: the end of a column store myth , 2012, SIGMOD Conference.

[34]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[35]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[36]  B. Everitt,et al.  A Handbook of Statistical Analyses using R , 2006 .

[37]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[38]  Wolfgang Lehner,et al.  SAP HANA: The Evolution from a Modern Main-Memory Data Platform to an Enterprise Application Platform , 2013, Proc. VLDB Endow..

[39]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[40]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[41]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[42]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[43]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[44]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[45]  Lei Gao,et al.  Serving large-scale batch computed data with project Voldemort , 2012, FAST.

[46]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[47]  Michael Stonebraker,et al.  The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[48]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[49]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[50]  Arvind Sathi,et al.  Big Data Analytics: Disruptive Technologies for Changing the Game , 2012 .

[51]  Yuri Demchenko,et al.  Architecture Framework and Components for the Big Data Ecosystem , 2013 .

[52]  Volker Markl,et al.  MapReduce and PACT - Comparing Data Parallel Programming Models , 2011, BTW.

[53]  Ashutosh Nandeshwar Tableau Data Visualization Cookbook , 2013 .

[54]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[55]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[56]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[57]  Ramakrishna Varadarajan,et al.  Materialization strategies in the Vertica analytic database: Lessons learned , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[58]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.