Big Data in the Cloud: A Survey

Big Data has become a hot topic across several business areas requiring the storage and processing of huge volumes of data. Cloud computing leverages Big Data by providing high storage and processing capabilities and enables corporations to consume resources in a pay-as-you-go model making clouds the optimal environment for storing and processing huge quantities of data. By using virtualized resources, Cloud can scale very easily, be highly available and provide massive storage capacity and processing power. This paper surveys existing databases models to store and process Big Data within a Cloud environment. Particularly, we detail the following traditional NoSQL databases: BigTable, Cassandra, DynamoDB, HBase, Hypertable, and MongoDB. The MapReduce framework and its developments Apache Spark, HaLoop, Twister, and other alternatives such as Apache Giraph, GraphLab, Pregel and MapD - a novel platform that uses GPU processing to accelerate Big Data processing - are also analyzed. Finally, we present two case studies that demonstrate the successful use of Big Data within Cloud environments and the challenges that must be addressed in the future.

[1]  Divyakant Agrawal,et al.  Scalable and elastic transactional data stores for cloud computing platforms , 2011 .

[2]  András A. Benczúr,et al.  Real-time streaming mobility analytics , 2013, 2013 IEEE International Conference on Big Data.

[3]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[4]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[5]  Charu C. Aggarwal,et al.  The Internet of Things: A Survey from the Data-Centric Perspective , 2013, Managing and Mining Sensor Data.

[6]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[7]  Jorge Bernardino,et al.  NoSQL databases: MongoDB vs cassandra , 2013, C3S2E '13.

[8]  Arun Venkataramani,et al.  Disaster Recovery as a Cloud Service: Economic Benefits & Deployment Challenges , 2010, HotCloud.

[9]  Geoffrey C. Fox,et al.  Applying Twister to Scientific Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[10]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[11]  Dorin Carstoiu,et al.  Hbase - non SQL Database, Performances Evaluation , 2010, Int. J. Adv. Comp. Techn..

[12]  Eduardo Gómez-Sánchez,et al.  Cloud computing and education: A state-of-the-art survey , 2015, Comput. Educ..

[13]  Jorge Bernardino,et al.  Testing Cloud Benchmark Scalability with Cassandra , 2014, 2014 IEEE World Congress on Services.

[14]  Wilfred Ng,et al.  Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees , 2014, Proc. VLDB Endow..

[15]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[16]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17]  Rajkumar Buyya,et al.  Big Data Computing and Clouds: Challenges, Solutions, and Future Directions , 2013, ArXiv.

[18]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[19]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[20]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[21]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[22]  Madhusudhan Govindaraju,et al.  An Evaluation of Cassandra for Hadoop , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[23]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[24]  Vladimir Vlassov,et al.  MapReduce: Limitations, Optimizations and Open Issues , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[25]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[26]  Tom Geller Supercomputing's exaflop target , 2011, CACM.

[27]  Alan Gates Programming Pig , 2011 .

[28]  Vanita Mane,et al.  SQL Support over MongoDB using Metadata , 2013 .

[29]  Y.Seetha Ramayya,et al.  A Study on Cloud Computing Disaster Recovery , 2013 .

[30]  Minghua Chen,et al.  Moving big data to the cloud , 2013, 2013 Proceedings IEEE INFOCOM.

[31]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[32]  D Ball Computing at CERN , 1972 .

[33]  Jong Sou Park,et al.  Disaster Recovery for System Architecture Using Cloud Computing , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[34]  Andrew Blake,et al.  Real-time traffic monitoring , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[35]  Victor I. Chang,et al.  Towards a Big Data system disaster recovery in a Private Cloud , 2015, Ad Hoc Networks.

[36]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[37]  V. Kavitha,et al.  A survey on security issues in service delivery models of cloud computing , 2011, J. Netw. Comput. Appl..

[38]  V. Ganesh,et al.  HBase and Hypertable for large scale distributed storage systems A Performance evaluation for Open Source BigTable Implementations , 2008 .

[39]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[40]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[41]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[42]  Le Gruenwald,et al.  Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems Workshops.

[43]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[44]  William A. Shaffer,et al.  Dynamo , 1980, Medical economics.

[45]  Martin Wattenberg,et al.  Google+Ripples: a native visualization of information flow , 2013, WWW '13.

[46]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[47]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.