A Survey of Large Scale Data Management Approaches in Cloud Environments

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data. Moreover, the recent advances in Web technology has made it easy for any user to provide and consume content of any form. This has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. Cloud computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to the network to reduce the costs associated with the management of hardware and software resources. This paper gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud which are gaining a lot of momentum in both research and industrial communities. We analyze the various design decisions of each approach and its suitability to support certain classes of applications and end-users. A discussion of some open issues and future challenges pertaining to scalability, consistency, economical processing of large scale data on the cloud is provided. We highlight the characteristics of the best candidate classes of applications that can be deployed in the cloud.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  L. Youseff,et al.  Toward a Unified Ontology of Cloud Computing , 2008, 2008 Grid Computing Environments Workshop.

[3]  Gustavo Alonso,et al.  Consistency Rationing in the Cloud: Pay only when it matters , 2009, Proc. VLDB Endow..

[4]  Chunming Qiao,et al.  Demonstration of joint resource scheduling in an optical network integrated computing environment [Topics in Optical Communications] , 2010, IEEE Communications Magazine.

[5]  James J. Kistler,et al.  Challenges, Techniques and Directions in Building XSeek: an XML Search Engine. , 2009 .

[6]  Daniel M. Batista,et al.  Performance Analysis of Available Bandwidth Estimation Tools for Grid Networks , 2009, CAMAD.

[7]  Massimo Lamanna,et al.  High-Energy Physics Applications on the Grid , 2009, Grid Computing.

[8]  Yehia El-khatib,et al.  A survey-based study of grid traffic , 2007, GridNets '07.

[9]  Xiaowei Yang,et al.  CloudCmp: Shopping for a Cloud Made Easy , 2010, HotCloud.

[10]  James J. Kistler,et al.  Building a Cloud for Yahoo! , 2009, IEEE Data Eng. Bull..

[11]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[12]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[13]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[14]  Dejan S. Milojicic,et al.  Open Cirrus: A Global Cloud Computing Testbed , 2010, Computer.

[15]  Bruce M. Maggs,et al.  Cutting the electric bill for internet-scale systems , 2009, SIGCOMM '09.

[16]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[17]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[18]  Bijan Jabbari,et al.  DRAGON: a framework for service provisioning in heterogeneous grid networks , 2006, IEEE Communications Magazine.

[19]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[20]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[21]  Douglas Stott Parker,et al.  Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters , 2009, DASFAA.

[22]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[23]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[24]  Kimberly Keeton,et al.  LazyBase: freshness vs. performance in information management , 2010, OPSR.

[25]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[26]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[27]  Víctor López,et al.  Performance evaluation of the Flow-Aware Networking (FAN) architecture under Grid environment , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[28]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[29]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[30]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[31]  Daniela Florescu,et al.  Rethinking cost and performance of database systems , 2009, SGMD.

[32]  Xiaowei Yang,et al.  CloudCmp: comparing public cloud providers , 2010, IMC '10.

[33]  Werner Vogels,et al.  Eventually consistent , 2008, CACM.

[34]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[35]  Qiang Song,et al.  Deployment of the GMPLS control plane for grid applications in experimental high-performance networks , 2006, IEEE Communications Magazine.

[36]  Daniel M. Batista,et al.  A survey of self-adaptive grids , 2010, IEEE Communications Magazine.

[37]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[38]  Robert L. Grossman,et al.  Lessons learned from a year's worth of benchmarks of large data clouds , 2009, MTAGS '09.

[39]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[40]  Douglas F. Parkhill,et al.  The Challenge of the Computer Utility , 1966 .

[41]  Abraham Silberschatz,et al.  HadoopDB in action: building real world applications , 2010, SIGMOD Conference.

[42]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[43]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .

[44]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[45]  Robert L. Grossman,et al.  The Open Cloud Testbed: A Wide Area Testbed for Cloud Computing Utilizing High Performance Network Services , 2009, ArXiv.

[46]  Luis Rodero-Merino,et al.  A break in the clouds: towards a cloud definition , 2008, CCRV.

[47]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[48]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[49]  Daniel M. Batista,et al.  Self-adjustment of resource allocation for grid applications , 2008, Comput. Networks.

[50]  Alexander S. Szalay,et al.  Petascale computational systems , 2007, Computer.

[51]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[52]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[53]  Steven Hand,et al.  Scripting the Cloud with Skywriting , 2010, HotCloud.

[54]  Tim Kraska,et al.  An evaluation of alternative architectures for transaction processing in the cloud , 2010, SIGMOD Conference.

[55]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[56]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[57]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[58]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[59]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[60]  Divyakant Agrawal,et al.  Database Management as a Service: Challenges and Opportunities , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[61]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[62]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[63]  Bingsheng He,et al.  Large graph processing in the cloud , 2010, SIGMOD Conference.

[64]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[65]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[66]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[67]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[68]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[69]  Yu Xu,et al.  Integrating hadoop and parallel DBMs , 2010, SIGMOD Conference.

[70]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[71]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[72]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[73]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[74]  J. Rössl Above the Clouds , 2012 .

[75]  GhemawatSanjay,et al.  The Google file system , 2003 .