An Efficient Approach for Storage of Big Data Streams in Distributed Stream Processing Systems

Besides, centralized managing, processing and querying, the storage is one of the important components of a big data management. There is always a huge requirement of storing immense volumes of heterogeneous data in different formats. In big data steam processing applications, the storage is given a priority and always plays a big role in historical data analysis. During stream processing, some of the incoming data and the intermediate results are always a good source of future samples. These samples can be used for the future evaluation to eliminate the numerous mistakes of storing and maintaining the big data streams. Hence, a big data stream application requires an efficient support for storage of historical queries. The researchers, scientist and academicians are working hard to develop a sophisticated mechanism that is needed for storage to keep the most useful data for the future references by means of stream archive storage. However, a stream processing system can’t store the whole incoming stream data for future references. A technique is needed to get rid of the expired data and free the space for more incoming data in an archive storage. Hence keeping in view, the storage space limitation, integration issues and its associated cost, we try to optimize the stream archive storage and free more space for future data. The proposed enhanced algorithm will help to delete the obsolete data (retention or expired) and free the space for the new incoming data in a distributed platform. Our paper presents an Enhanced Time Expired Algorithm (ETEA) for stream archived storage in a distributed environment for removing the obsolete data based on time expiration and providing a space for the new incoming data for historical data analysis during the skew time (Hot Spots).We also evaluated the efficiency of our algorithm using the skew factor. The experimental results show that our approach is 98% efficient and fast than other conventional techniques.

[1]  Pankaj Deep Kaur,et al.  Managing Big Data: A Step towards Huge Data Security , 2016 .

[2]  Beng Chin Ooi,et al.  In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[3]  Sazia Mahfuz,et al.  A Survey of Distributed Data Stream Processing Frameworks , 2019, IEEE Access.

[4]  Kostas Magoutis,et al.  Scalable storage support for data stream processing , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Ao Tang,et al.  Distributed Resource Allocation for Stream Data Processing , 2006, HPCC.

[6]  K. Bakshi,et al.  Considerations for big data: Architecture and approach , 2012, 2012 IEEE Aerospace Conference.

[7]  Alessandro Margara,et al.  FlowDB: Integrating Stream Processing and Consistent State Management , 2017, DEBS.

[8]  Dazhi Chong,et al.  Big data analytics: a literature review , 2015 .

[9]  Abdullah Gani,et al.  Big data storage technologies: a survey , 2017, Frontiers of Information Technology & Electronic Engineering.

[10]  Xinwei Fu,et al.  EdgeWise: A Better Stream Processing Engine for the Edge , 2019, USENIX ATC.

[11]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[12]  Philip S. Yu,et al.  Storage Optimization for Large-Scale Distributed Stream Processing Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  A. B. M. Moniruzzaman NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management , 2014, ArXiv.

[14]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[15]  Ray Y. Zhong,et al.  Big Data for supply chain management in the service and manufacturing sectors: Challenges, opportunities, and future perspectives , 2016, Comput. Ind. Eng..

[16]  E. Hargittai Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites , 2015 .

[17]  Ashraf Aboulnaga,et al.  Accordion: Elastic Scalability for Database Systems Supporting Distributed Transactions , 2014, Proc. VLDB Endow..

[18]  Hai Jin,et al.  Building a network highway for big data: architecture and challenges , 2014, IEEE Network.

[19]  Rajkumar Buyya,et al.  The anatomy of big data computing , 2015, Softw. Pract. Exp..

[20]  Yves Le Traon,et al.  GreyCat: Efficient What-If Analytics for Data in Motion at Scale , 2018, Inf. Syst..

[21]  Hesham Alhumyani,et al.  High Availability of Data using Automatic Selection Algorithm (ASA) in Distributed Stream Processing Systems , 2019 .

[22]  Chen,et al.  Storage optimization for query processing over data streams , 2010 .

[23]  Gustavo Alonso,et al.  Flexible and scalable storage management for data-intensive stream processing , 2009, EDBT '09.

[24]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[25]  Swapna S. Gokhale,et al.  Arion: A Model-Driven Middleware for Minimizing Data Loss in Stream Data Storage , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[26]  Christophe Nicolle,et al.  Understandable Big Data: A survey , 2015, Comput. Sci. Rev..

[27]  Zhihan Lv,et al.  Next-Generation Big Data Analytics: State of the Art, Challenges, and Future Research Topics , 2017, IEEE Transactions on Industrial Informatics.

[28]  Liang Chen,et al.  Resource allocation in a middleware for streaming data , 2004, MGC '04.

[29]  Rajkumar Buyya,et al.  A Taxonomy and Survey of Stream Processing Systems , 2017 .

[30]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[31]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[32]  Carlo Curino,et al.  Lookup Tables: Fine-Grained Partitioning for Distributed Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[33]  Patrick P. C. Lee,et al.  Toward High-Performance Distributed Stream Processing via Approximate Fault Tolerance , 2016, Proc. VLDB Endow..

[34]  Beng Chin Ooi,et al.  ES2: A cloud data storage system for supporting both OLTP and OLAP , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[35]  Michael Stonebraker,et al.  Handling Shared, Mutable State in Stream Processing with Correctness Guarantees , 2015, IEEE Data Eng. Bull..

[36]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[37]  Mohd Aizaini Maarof,et al.  A Taxonomy and Survey of Data Partitioning Algorithms for Big Data Distributed Systems , 2020 .

[39]  Shraddha Phansalkar,et al.  Survey of data partitioning algorithms for big data stores , 2016, 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC).

[40]  Dr.Mohammed B. Abdullah,et al.  Refine Priority Queuing Scheduling Algorithm By Applying Time Expired Packets Algorithm , 2012 .

[41]  Rebecca Taft Elastic database systems , 2017 .

[42]  Jennifer Widom,et al.  Query Processing, Resource Management, and Approximation ina Data Stream Management System , 2002 .

[43]  Jian Yin,et al.  Position: short object lifetimes require a delete-optimized storage system , 2004, EW 11.