Storage Size Estimation for Schemaless Big Data Applications: A JSON-based Overview

Numerous technologies have been proposed for storing big data on the Cloud platform. However, choice of these technologies is always application specific. Determining a strong model is a perplexing task which makes it necessary for the architects and designers to review the requirements and choose a solution. This paper presents 14 data models available in the market to choose from. Above all, there are more than 45 database solutions available in the market, which can be categorized into one of the data models each of which is applicable to its own set of use cases (However, there are few products which could not be categorized into any of these 14 data models). Contributors have figured out that while storing schemaless information, the size of data stored in the database is higher than the original size. Metadata information and physical schema are the two responsible factors for such a high amount of storage requirement. Mathematical models and experimental evaluations conducted show that MongoDB requires storage space many times more than the original size of data. A storage space estimation equation for JSON-based solutions has been suggested, which can compare the storage requirement size using space required by CSV as a base. This may be used to decide an approximate amount of storage space required by the application, before buying a storage space in the Cloud environment.

[1]  Krishna Kant,et al.  Data center evolution: A tutorial on state of the art, issues, and challenges , 2009, Comput. Networks.

[2]  Peter Sanders Algorithm Engineering for Big Data , 2014, GI-Jahrestagung.

[3]  Dursun Delen,et al.  Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud , 2013, Decis. Support Syst..

[4]  Rabi Prasad Padhy,et al.  RDBMS to NoSQL: Reviewing Some Next-Generation Non-Relational Database's , 2011 .

[5]  Jörg Daubert,et al.  Big Data Storage , 2021, New Horizons for a Data-Driven Economy.

[6]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[7]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[8]  José A. Montoya CAPACITATED FACILITY LOCATION PROBLEM WITH GENERAL OPERATING AND BUILDING COSTS , 2012 .

[9]  Patrick Martin,et al.  The Six Pillars for Building Big Data Analytics Ecosystems , 2016, ACM Comput. Surv..

[10]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[11]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[12]  Trilok Chand Sharma,et al.  WEKA Approach for Comparative Study of Classification Algorithm , 2013 .

[13]  J. Hellerstein,et al.  What Goes Around Comes Around , 2004 .

[14]  Tilmann Rabl,et al.  Solving Big Data Challenges for Enterprise Application Performance Management , 2012, Proc. VLDB Endow..

[15]  Jeffrey Scott Vitter,et al.  Strategic directions in storage I/O issues in large-scale computing , 1996, CSUR.

[16]  Avita Katal,et al.  Big data: Issues, challenges, tools and Good practices , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[17]  Wolfgang Wahlster,et al.  New Horizons for a Data-Driven Economy , 2016, Springer International Publishing.

[18]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[19]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[20]  Yakov Shafranovich,et al.  Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.