Survey of real-time processing systems for big data

In recent years, real-time processing and analytics systems for big data--in the context of Business Intelligence (BI)--have received a growing attention. The traditional BI platforms that perform regular updates on daily, weekly or monthly basis are no longer adequate to satisfy the fast-changing business environments. However, due to the nature of big data, it has become a challenge to achieve the real-time capability using the traditional technologies. The recent distributed computing technology, MapReduce, provides off-the-shelf high scalability that can significantly shorten the processing time for big data; Its open-source implementation such as Hadoop has become the de-facto standard for processing big data, however, Hadoop has the limitation of supporting real-time updates. The improvements in Hadoop for the real-time capability, and the other alternative real-time frameworks have been emerging in recent years. This paper presents a survey of the open source technologies that support big data processing in a real-time/near real-time fashion, including their system architectures and platforms.

[1]  Jun Rao,et al.  Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[2]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[3]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[4]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[5]  Apache Kafka A high-throughput distributed messaging system . Kafka 0 . 9 . 0 Documentation 1 , 2022 .

[6]  Torben Bach Pedersen,et al.  ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[7]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[8]  Torben Bach Pedersen,et al.  MapReduce-based Dimensional ETL Made Easy , 2012, Proc. VLDB Endow..

[9]  Torben Bach Pedersen,et al.  RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Torben Bach Pedersen,et al.  CloudETL: scalable dimensional ETL for hive , 2014, IDEAS.

[11]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[14]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[15]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[16]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[17]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[18]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Xiufeng Liu,et al.  Data warehousing technologies for large-scale and right-time data , 2012 .

[21]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[22]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[23]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .