Hardware Acceleration Landscape for Distributed Real-Time Analytics: Virtues and Limitations

We are witnessing a technological revolution with a broad impact ranging from daily life (e.g., personalized medicine and education) to industry (e.g., data-driven healthcare, commerce, agriculture, and mining). At the core of this transformation lies "data". This transformation is facilitated by embedded devices, collectively known as Internet of Things (IoT), which produce real-time feeds of sensor data which are collected and processed to produce a dynamic physical model used for optimized real-time decision making. At the infrastructure level, there is a need to develop a scalable architecture for processing massive volumes of present and historical data at an unprecedented velocity to support the IoT paradigm. To cope with such extreme scale, we argue for the need to revisit the hardware and software co-design landscape in light of two key technological advancements. First is the virtualization of computation and storage over highly distributed data centers spanning across continents. Second is the emergence of a variety of specialized hardware accelerators that complement traditional general-purpose processors. Further efforts are required to unify these two trends in order to harness the power of big data. In this paper, we present a formulation and characterization of the hardware acceleration landscape geared towards real-time analytics in the cloud. Our goal is to assist both researchers and practitioners navigating the newly revived field of software and hardware co-design for building next generation distributed systems. We further present a case study to explore software and hardware interplay for designing distributed real-time stream processing.

[1]  Jens Teubner,et al.  Robust Query Processing in Co-Processor-accelerated Databases , 2016, SIGMOD Conference.

[2]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[3]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Mohammad Sadoghi,et al.  Accelerating database workloads by software-hardware-system co-design , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[5]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[6]  Hans-Arno Jacobsen,et al.  The FQP Vision: Flexible Query Processing on a Reconfigurable Computing Fabric , 2015, SGMD.

[7]  Jens Teubner,et al.  How soccer players would do stream joins , 2011, SIGMOD '11.

[8]  Jens Teubner,et al.  Low-Latency Handshake Join , 2014, Proc. VLDB Endow..

[9]  Jason Yang,et al.  Symmetric Key Cryptography on Modern Graphics Hardware , 2007, ASIACRYPT.

[10]  Gustavo Alonso,et al.  Streams on Wires - A Query Compiler for FPGAs , 2009, Proc. VLDB Endow..

[11]  Jignesh M. Patel,et al.  Design and evaluation of main memory hash join algorithms for multi-core CPUs , 2011, SIGMOD '11.

[12]  Jorge Cabral,et al.  Towards an FPGA-based edge device for the Internet of Things , 2015, 2015 IEEE 20th Conference on Emerging Technologies & Factory Automation (ETFA).

[13]  Hans-Arno Jacobsen,et al.  Towards highly parallel event processing through reconfigurable hardware , 2011, DaMoN '11.

[14]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Hans-Arno Jacobsen,et al.  Multi-query Stream Processing on FPGAs , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  Hans-Arno Jacobsen,et al.  Efficient event processing through reconfigurable hardware for algorithmic trading , 2010, Proc. VLDB Endow..

[17]  Philip S. Yu,et al.  CellJoin: a parallel stream join operator for the cell processor , 2009, The VLDB Journal.

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  Chao Wang,et al.  SODA: Software defined FPGA based accelerators for big data , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20]  Hans-Arno Jacobsen,et al.  Flexible Query Processor on FPGAs , 2013, Proc. VLDB Endow..

[21]  Toshimori Honjo,et al.  Hardware acceleration of Hadoop MapReduce , 2013, 2013 IEEE International Conference on Big Data.

[22]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[23]  Weng-Fai Wong,et al.  A computing origami: Folding streams in FPGAs , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[24]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[25]  Martin Margala,et al.  High level programming framework for FPGAs in the data center , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[26]  Jens Teubner,et al.  Data Processing on FPGAs , 2013, Proc. VLDB Endow..

[27]  Gustavo Alonso,et al.  Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures , 2017, SIGMOD Conference.

[28]  Hans-Arno Jacobsen,et al.  Configurable hardware-based streaming architecture using Online Programmable-Blocks , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[29]  Jens Teubner,et al.  Skeleton automata for FPGAs: reconfiguring without reconstructing , 2012, SIGMOD Conference.

[30]  Hans-Arno Jacobsen,et al.  SplitJoin: A Scalable, Low-latency Stream Join Architecture with Adjustable Ordering Precision , 2016, USENIX Annual Technical Conference.

[31]  Wei Zhang,et al.  Relational query processing on OpenCL-based FPGAs , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[32]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[33]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[34]  Viktor Leis,et al.  Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age , 2014, SIGMOD Conference.

[35]  Gustavo Alonso,et al.  Complex event detection at wire speed with FPGAs , 2010, Proc. VLDB Endow..

[36]  Hans-Arno Jacobsen,et al.  fpga-ToPSS: line-speed event processing on fpgas , 2011, DEBS '11.

[37]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.