Kera: A Unified Storage and Ingestion Architecture for Efficient Stream Processing
暂无分享,去创建一个
Big Data applications are rapidly moving from a batch-oriented execution to a
real-time model in order to extract value from the streams of data just as
fast as they arrive. Such stream-based applications need to immediately ingest
and analyze data and in many use cases combine live (i.e., real-time streams)
and archived data in order to extract better insights. Current streaming
architectures are designed with distinct components for ingestion (e.g.,
Kafka) and storage (e.g., HDFS) of stream data. Unfortunately, this separation
is becoming an overhead especially when data needs to be archived for later
analysis (i.e., near real-time): in such use cases, stream data has to be
written twice to disk and may pass twice over high latency networks. Moreover,
current ingestion mechanisms offer no support for searching the acquired
streams in real time, an important requirement to promptly react to fast data.
In this paper we describe the design of Kera: a unified storage and
ingestion architecture that could better serve the specific needs of stream
processing. We identify a set of design principles for stream-based Big Data
processing that guide us in designing a novel architecture for streaming. We
design Kera in order to reduce the storage and network utilization
significantly, which can lead to reduced times for stream processing and
archival. To this end, we propose a set of optimization techniques for handling
streams with a log-structured (in memory and on disk) approach. On top of our
envisioned architecture we devise the implementation of an efficient interface
for data ingestion, processing, and storage (DIPS), an interplay between
processing engines and smart storage systems, with the goal to reduce the
end-to-end stream processing latency.
[1] David Maier,et al. Semantics of Data Streams and Operators , 2005, ICDT.
[2] Zheguang Zhao,et al. Bridging the Gap between HPC and Big Data frameworks , 2017, Proc. VLDB Endow..
[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.