Big data storage technologies: a survey

There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed ‘big data’. The structural shift of the storage mechanism from traditional data management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mechanism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the existing approaches using Brewer’s CAP theorem. The significance and applications of storage technologies and support to other categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a reliable and scalable storage system.

[1]  Abdullah Gani,et al.  A survey on indexing techniques for big data: taxonomy and performance evaluation , 2016, Knowledge and Information Systems.

[2]  Apostolos V. Zarras,et al.  Growing up with stability: How open-source relational databases evolve , 2015, Inf. Syst..

[3]  John Klein,et al.  Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems , 2015, IEEE Software.

[4]  Chiranjeev Kumar,et al.  A scalable generic transaction model scenario for distributed NoSQL databases , 2015, J. Syst. Softw..

[5]  Hailong Sun,et al.  On the tradeoff of availability and consistency for quorum systems in data center networks , 2015, Comput. Networks.

[6]  K. Selçuk Candan,et al.  Efficient Static and Dynamic In-Database Tensor Decompositions on Chunk-Based Array Stores , 2014, CIKM.

[7]  K. Selçuk Candan,et al.  TensorDB: In-Database Tensor Manipulation with Tensor-Relational Query Plans , 2014, CIKM.

[8]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[9]  Alexandros Nanopoulos,et al.  Storage-optimizing clustering algorithms for high-dimensional tick data , 2014, Expert Syst. Appl..

[10]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[11]  Andrzej Cichocki,et al.  Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions , 2014, ArXiv.

[12]  Haiming Zhang,et al.  Benchmarking Replication and Consistency Strategies in Cloud Serving Databases: HBase and Cassandra , 2014, BPOE@ASPLOS/VLDB.

[13]  Ganesh Chandra Deka,et al.  A Survey of Cloud Database Systems , 2014, IT Professional.

[14]  Cevdet Aykanat,et al.  Temporal Workload-Aware Replicated Partitioning for Social Networks , 2014, IEEE Transactions on Knowledge and Data Engineering.

[15]  Guillermo Ricardo Simari,et al.  Relational databases as a massive information source for defeasible argumentation , 2013, Knowl. Based Syst..

[16]  Feng Xu,et al.  Survey of Research on Big Data Storage , 2013, 2013 12th International Symposium on Distributed Computing and Applications to Business, Engineering & Science.

[17]  Yong Hu,et al.  Extracting deltas from column oriented NoSQL databases for different incremental applications and diverse data targets , 2013, Data Knowl. Eng..

[18]  Arun Prakash Agrawal,et al.  Comparative analysis of Relational and Graph databases , 2013 .

[19]  Jorge Bernardino,et al.  NoSQL databases: MongoDB vs cassandra , 2013, C3S2E '13.

[20]  Josiah L. Carlson,et al.  Redis in Action , 2013 .

[21]  Albert Y. Zomaya,et al.  A Bee Colony based optimization approach for simultaneous job scheduling and data replication in grid environments , 2013, Comput. Oper. Res..

[22]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[23]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[24]  Jorge-Arnulfo Quiané-Ruiz,et al.  Only Aggressive Elephants are Fast Elephants , 2012, Proc. VLDB Endow..

[25]  Dieter Kranzlmüller,et al.  Trends in Computation, Communication and Storage and the Consequences for Data-intensive Science , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[26]  Swaminathan Sivasubramanian,et al.  Amazon dynamoDB: a seamlessly scalable non-relational database service , 2012, SIGMOD Conference.

[27]  Shang Gao,et al.  Modeling a Dynamic Data Replication Strategy to Increase System Availability in Cloud Computing Environments , 2012, Journal of Computer Science and Technology.

[28]  Lei Gao,et al.  Serving large-scale batch computed data with project Voldemort , 2012, FAST.

[29]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[30]  Raghu Ramakrishnan,et al.  CAP and Cloud Data Management , 2012, Computer.

[31]  Matthew Helmke,et al.  Ubuntu Unleashed 2012 Edition: Covering 11.10 and 12.04 , 2012 .

[32]  Kyle Banker,et al.  MongoDB in Action , 2011 .

[33]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[34]  Stefan Jablonski,et al.  NoSQL evaluation: A use case oriented survey , 2011, 2011 International Conference on Cloud and Service Computing.

[35]  Jaroslav Pokorný,et al.  NoSQL databases: a step to database scalability in web environment , 2011, iiWAS '11.

[36]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[37]  Zhifeng Xiao,et al.  Remote sensing image database based on NOSQL database , 2011, 2011 19th International Conference on Geoinformatics.

[38]  Cristian Bucur,et al.  A comparison between several NoSQL databases with comments and notes , 2011, 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research.

[39]  Helmar Burkhart,et al.  Social-data storage-systems , 2011, DBSocial '11.

[40]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[41]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[42]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[43]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[44]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[45]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .

[46]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[47]  Mocky Habeeb,et al.  A Developer's Guide to Amazon SimpleDB , 2010 .

[48]  Borislav Iordanov,et al.  HyperGraphDB: A Generalized Graph Database , 2010, WAIM Workshops.

[49]  Josep-Lluís Larriba-Pey,et al.  Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark , 2010, WAIM Workshops.

[50]  Chandra Krintz,et al.  An Evaluation of Distributed Datastores Using the AppScale Cloud Platform , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[51]  Hong Liu,et al.  Fiber optic communication technologies: What's needed for datacenter network operations , 2010, IEEE Communications Magazine.

[52]  L. Excoffier,et al.  Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows , 2010, Molecular ecology resources.

[53]  Yixin Chen,et al.  A comparison of a graph database and a relational database: a data provenance perspective , 2010, ACM SE '10.

[54]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[55]  Anol Bhattacherjee,et al.  Organizational adoption of open source software: barriers and remedies , 2010, CACM.

[56]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[57]  David A. Patterson,et al.  SCADS: Scale-Independent Storage for Social Computing Applications , 2009, CIDR.

[58]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[59]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[60]  Daniel J. Abadi,et al.  Column oriented Database Systems , 2009, Proc. VLDB Endow..

[61]  Florian Schintke,et al.  Scalaris: reliable transactional p2p key/value store , 2008, ERLANG '08.

[62]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[63]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[64]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[65]  Edward Sciore,et al.  SimpleDB: a simple java-based multiuser syst for teaching database internals , 2007, SIGCSE.

[66]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[67]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[68]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[69]  GhemawatSanjay,et al.  The Google file system , 2003 .

[70]  Cherié L. Weible,et al.  The Internet Movie Database , 2001 .

[71]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[72]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[73]  Jim Gray,et al.  The Transaction Concept: Virtues and Limitations (Invited Paper) , 1981, VLDB.

[74]  R. Logesh,et al.  Unstructured Data Analysis on Big Data Using Map Reduce , 2015 .

[75]  Anand R. Tripathi,et al.  Scalable Transaction Management with Snapshot Isolation for NoSQL Data Storage Systems , 2015, IEEE Transactions on Services Computing.

[76]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[77]  Yahya Slimani,et al.  CAP Theorem between Claims and Misunderstandings: What is to be Sacrificed? , 2013 .

[78]  Goran D. Putnik,et al.  Scalability in manufacturing systems design and operation: State-of-the-art and future developments roadmap , 2013 .

[79]  Fatos Xhafa,et al.  P2P data replication and trustworthiness for a JXTA-Overlay P2P system using fuzzy logic , 2013, Appl. Soft Comput..

[80]  Dervis Karaboga,et al.  A novel clustering approach: Artificial Bee Colony (ABC) algorithm , 2011, Appl. Soft Comput..

[81]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[82]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[83]  Jans Aasman,et al.  Event Processing using an RDF Database , 2009, AAAI Spring Symposium: Intelligent Event Processing.

[84]  Andrew S. Tanenbaum,et al.  Distributed Systems , 2007 .

[85]  Karl von Frisch,et al.  The Bee Colony , 1954 .