论文信息 - Kera: A Unified Storage and Ingestion Architecture for Efficient Stream Processing

Kera: A Unified Storage and Ingestion Architecture for Efficient Stream Processing

Big Data applications are rapidly moving from a batch-oriented execution to a real-time model in order to extract value from the streams of data just as fast as they arrive. Such stream-based applications need to immediately ingest and analyze data and in many use cases combine live (i.e., real-time streams) and archived data in order to extract better insights. Current streaming architectures are designed with distinct components for ingestion (e.g., Kafka) and storage (e.g., HDFS) of stream data. Unfortunately, this separation is becoming an overhead especially when data needs to be archived for later analysis (i.e., near real-time): in such use cases, stream data has to be written twice to disk and may pass twice over high latency networks. Moreover, current ingestion mechanisms offer no support for searching the acquired streams in real time, an important requirement to promptly react to fast data. In this paper we describe the design of Kera: a unified storage and ingestion architecture that could better serve the specific needs of stream processing. We identify a set of design principles for stream-based Big Data processing that guide us in designing a novel architecture for streaming. We design Kera in order to reduce the storage and network utilization significantly, which can lead to reduced times for stream processing and archival. To this end, we propose a set of optimization techniques for handling streams with a log-structured (in memory and on disk) approach. On top of our envisioned architecture we devise the implementation of an efficient interface for data ingestion, processing, and storage (DIPS), an interplay between processing engines and smart storage systems, with the goal to reduce the end-to-end stream processing latency.

María S. Pérez-Hernández | Gabriel Antoniu | Alexandru Costan | Ovidiu-Cristian Marcu

[1] David Maier,et al. Semantics of Data Streams and Operators , 2005, ICDT.

[2] Zheguang Zhao,et al. Bridging the Gap between HPC and Big Data frameworks , 2017, Proc. VLDB Endow..

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.