论文信息 - Survey of real-time processing systems for big data

Survey of real-time processing systems for big data

In recent years, real-time processing and analytics systems for big data--in the context of Business Intelligence (BI)--have received a growing attention. The traditional BI platforms that perform regular updates on daily, weekly or monthly basis are no longer adequate to satisfy the fast-changing business environments. However, due to the nature of big data, it has become a challenge to achieve the real-time capability using the traditional technologies. The recent distributed computing technology, MapReduce, provides off-the-shelf high scalability that can significantly shorten the processing time for big data; Its open-source implementation such as Hadoop has become the de-facto standard for processing big data, however, Hadoop has the limitation of supporting real-time updates. The improvements in Hadoop for the real-time capability, and the other alternative real-time frameworks have been emerging in recent years. This paper presents a survey of the open source technologies that support big data processing in a real-time/near real-time fashion, including their system architectures and platforms.

[1] Jun Rao,et al. Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[2] Yuanyuan Tian,et al. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[3] Joseph M. Hellerstein,et al. MapReduce Online , 2010, NSDI.

[4] Vinay Setty,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[5] Apache Kafka A high-throughput distributed messaging system . Kafka 0 . 9 . 0 Documentation 1 , 2022 .

[6] Torben Bach Pedersen,et al. ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[7] Nathan Marz,et al. Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[8] Torben Bach Pedersen,et al. MapReduce-based Dimensional ETL Made Easy , 2012, Proc. VLDB Endow..

[9] Torben Bach Pedersen,et al. RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10] Torben Bach Pedersen,et al. CloudETL: scalable dimensional ETL for hive , 2014, IDEAS.

[11] Jay Kreps,et al. Kafka : a Distributed Messaging System for Log Processing , 2011 .

[12] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13] Scott Shenker,et al. Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[14] Michael Stonebraker,et al. MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[15] Leonardo Neumeyer,et al. S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[16] Carl Hewitt,et al. A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[17] Daniel J. Abadi,et al. Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[18] Yon Dohn Chung,et al. Parallel data processing with MapReduce: a survey , 2012, SGMD.

[19] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20] Xiufeng Liu,et al. Data warehousing technologies for large-scale and right-time data , 2012 .

[21] Scott Shenker,et al. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[22] Zhiwei Xu,et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[23] Andrey Gubarev,et al. Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .