FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures

Accelerating SQL queries on stream processing by utilizing heterogeneous coprocessors, such as GPUs, has shown to be an effective approach. Most works show that heterogeneous coprocessors bring significant performance improvement because of their high parallelism and computation capacity. However, the discrete memory architectures with relatively low PCI-e bandwidth and high latency have dragged down the benefits of heterogeneous coprocessors. Recently, hardware vendors propose CPU-GPU integrated architectures that integrate CPU and GPU on the same chip. This integration provides new opportunities for fine-grained cooperation between CPU and GPU for optimizing SQL queries on stream processing. In this paper, we propose a data stream system, called FineStream, for efficient window-based stream processing on integrated architectures. Particularly, FineStream performs fine-grained workload scheduling between CPU and GPU to take advantage of both architectures, and it also provides efficient mechanism for handling dynamic stream queries. Our experimental results show that 1) on integrated architectures, FineStream achieves an average 52% throughput improvement and 36% lower latency over the state-of-the-art stream processing engine; 2) compared to the stream processing engine on the discrete architecture, FineStream on the integrated architecture achieves 10.4x price-throughput ratio, 1.8x energy efficiency, and can enjoy lower latency benefits.

[1]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[2]  Sebastian Breß,et al.  Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS , 2013, Proc. VLDB Endow..

[3]  Zhi Tang,et al.  Multithread Content Based File Chunking System in CPU-GPGPU Heterogeneous Architecture , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[4]  Sid Touati,et al.  Advanced backend code optimization , 2014 .

[5]  Ada Gavrilovska,et al.  Kleio: A Hybrid Memory Page Scheduler with Machine Intelligence , 2019, HPDC.

[6]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[7]  Bingsheng He,et al.  DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU Architectures , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[8]  Alexander L. Wolf,et al.  SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures , 2016, SIGMOD Conference.

[9]  Assaf Schuster,et al.  Processing data streams with hard real-time constraints on heterogeneous systems , 2011, ICS '11.

[10]  Bingsheng He,et al.  Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[11]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[12]  KyoungSoo Park,et al.  APUNet: Revitalizing GPU as Packet Processing Accelerator , 2017, NSDI.

[13]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[14]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[15]  Gunter Saake,et al.  Toward GPU Accelerated Data Stream Processing , 2015, GvD.

[16]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Mark Silberstein,et al.  SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs , 2019, USENIX Annual Technical Conference.

[18]  Marco Danelutto,et al.  GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs , 2019, IEEE Access.

[19]  Dong Nguyen,et al.  Communication-aware mapping of stream graphs for multi-GPU platforms , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[20]  Anastasia Ailamaki,et al.  HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines , 2019, Proc. VLDB Endow..

[21]  Frank Mueller,et al.  GStream: A General-Purpose Data Streaming Framework on GPU Clusters , 2011, 2011 International Conference on Parallel Processing.

[22]  Cédric Augonnet,et al.  Data-Aware Task Scheduling on Multi-accelerator Based Platforms , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[23]  Ada Gavrilovska,et al.  NVStream: accelerating HPC workflows with NVRAM-based transport for streaming objects , 2018, HPDC.

[24]  Xinyu Li,et al.  Thinking about A New Mechanism for Huge Page Management , 2019, APSys '19.

[25]  Yuchen Li,et al.  GPU-Accelerated Subgraph Enumeration on Partitioned Graphs , 2020, SIGMOD Conference.

[26]  Yuni Xia,et al.  GStreamMiner: A GPU-accelerated Data Stream Mining Framework , 2016, CIKM.

[27]  Badrish Chandramouli,et al.  Accurate latency estimation in a distributed event processing system , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[28]  Thomas F. Wenisch,et al.  Practical off-chip meta-data for temporal memory streaming , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[29]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[30]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[31]  Xinyu Li,et al.  Hierarchical Hybrid Memory Management in OS for Tiered Memory Systems , 2019, IEEE Transactions on Parallel and Distributed Systems.

[32]  Wenguang Chen,et al.  Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures , 2021, IEEE Transactions on Knowledge and Data Engineering.

[33]  Holger Ziekow,et al.  The DEBS 2014 grand challenge , 2014, DEBS '14.

[34]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[35]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[36]  Kun-Lung Wu,et al.  Elastic Scaling for Data Stream Processing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[37]  Vishakha Gupta,et al.  Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies , 2011, VTDC '11.

[38]  Nitin Agrawal,et al.  Low-Latency Analytics on Colossal Data Streams with SummaryStore , 2017, SOSP.

[39]  Wenguang Chen,et al.  Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures , 2017, IEEE Transactions on Parallel and Distributed Systems.

[40]  Bingsheng He,et al.  OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures , 2013, Proc. VLDB Endow..

[41]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[42]  Gary Brown,et al.  Denver: Nvidia's First 64-bit ARM Processor , 2015, IEEE Micro.

[43]  Sanath Jayasena,et al.  Latency-Aware Secure Elastic Stream Processing with Homomorphic Encryption , 2019, Data Science and Engineering.

[44]  Matthew Poremba,et al.  Design and Analysis of an APU for Exascale Computing , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[45]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[46]  Jie Xu,et al.  Analysis, Modeling and Simulation of Workload Patterns in a Large-Scale Utility Cloud , 2014, IEEE Transactions on Cloud Computing.

[47]  Ulrich Schipper,et al.  A Scalable Software Framework for Stateful Stream Data Processing on Multiple GPUs and Applications , 2015 .

[48]  Martin L. Kersten,et al.  Breaking the memory wall in MonetDB , 2008, CACM.

[49]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[50]  Xinwei Fu,et al.  EdgeWise: A Better Stream Processing Engine for the Edge , 2019, USENIX ATC.

[51]  Indrani Paul,et al.  Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.

[52]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[53]  Badrish Chandramouli,et al.  Quill: Efficient, Transferable, and Rich Analytics at Scale , 2016, Proc. VLDB Endow..

[54]  Bei Hua,et al.  A holistic approach to build real-time stream processing system with GPU , 2015, J. Parallel Distributed Comput..

[55]  Samuel Madden,et al.  Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware , 2016, Proc. VLDB Endow..

[56]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[57]  Wenguang Chen,et al.  FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[58]  Feng Zhang,et al.  Hardware-Conscious Stream Processing , 2020, SIGMOD Rec..

[59]  Niranjan Balasubramanian,et al.  MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU , 2017, EMDL '17.

[60]  Jian Tang,et al.  G-Storm: GPU-enabled high-throughput online data processing in Storm , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[61]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[62]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[63]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[64]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..