Platform Agnostic Streaming Data Application Performance Models

The mapping of computational needs onto execution resources is, by and large, a manual task, and users are frequently guided simply by intuition and past experiences. We present a queueing theory based performance model for streaming data applications that takes steps towards a better understanding of resource mapping decisions, thereby assisting application developers to make good mapping choices. The performance model (and associated cost model) are agnostic to the specific properties of the compute resource and application, simply characterizing them by their achievable data throughput. We illustrate the model with a pair of applications, one chosen from the field of computational biology and the second is a classic machine learning problem.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  K. Mani Chandy,et al.  Open, Closed, and Mixed Networks of Queues with Different Classes of Customers , 1975, JACM.

[3]  Jeremy Kepner,et al.  Survey and Benchmarking of Machine Learning Accelerators , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[4]  Jeremy Buhler,et al.  MERCATOR: A GPGPU Framework for Irregular Streaming Applications , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[5]  J. Buhler,et al.  Biosequence Similarity Search on the Mercury System , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[6]  Roger D. Chamberlain,et al.  DIBS: A Data Integration Benchmark Suite , 2018, ICPE Companion.

[7]  Lizy Kurian John,et al.  Improving CNN performance on FPGA clusters through topology exploration , 2021, SAC.

[8]  Peng Li,et al.  RaftLib: a C++ template library for high performance stream parallel processing , 2015, PMAM '15.

[9]  Maya Gokhale,et al.  Stream-oriented FPGA computing in the Streams-C high level language , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[10]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[11]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[12]  Roger D. Chamberlain,et al.  Architecturally truly diverse systems: A review , 2020, Future Gener. Comput. Syst..

[13]  Joseph M. Lancaster,et al.  Acceleration of ungapped extension in Mercury BLAST , 2009, Microprocess. Microsystems.

[14]  Roger D. Chamberlain,et al.  Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).

[15]  Martin C. Herbordt,et al.  Single pass streaming BLAST on FPGAs , 2007, Parallel Comput..

[16]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[17]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[18]  Roger D. Chamberlain,et al.  Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[19]  Jeremy Buhler,et al.  Scheduling Irregular Dataflow Pipelines on SIMD Architectures , 2020, WPMVP@PPoPP.

[20]  J. Buhler,et al.  Reducing Queuing Impact in Irregular Data Streaming Applications , 2020, 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3).

[21]  Nikolaos V. Sahinidis,et al.  GPU-BLAST: using graphics processors to accelerate protein sequence alignment , 2010, Bioinform..

[22]  Yixin Chen,et al.  Optimal design-space exploration of streaming applications , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[23]  Amna Shahid,et al.  A Survey Comparing Specialized Hardware And Evolution In TPUs For Neural Networks , 2020, 2020 IEEE 23rd International Multitopic Conference (INMIC).

[24]  Andrew A. Chien,et al.  UDP: A Programmable Accelerator for Extract-Transform-Load Workloads and More , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Hao Wang,et al.  cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU , 2017, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[26]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[27]  Roger D. Chamberlain,et al.  Designing Domain Specific Computing Systems , 2020, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[30]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[32]  Lin Ma,et al.  Bloom Filter Performance on Graphics Engines , 2011, 2011 International Conference on Parallel Processing.

[33]  Willy Zwaenepoel,et al.  Everything you always wanted to know about multicore graph processing but were afraid to ask , 2017, USENIX Annual Technical Conference.

[34]  Xu Liu,et al.  A Hybrid GPU-FPGA-based Computing Platform for Machine Learning , 2018, EUSPN/ICTH.

[35]  Joseph M. Lancaster,et al.  Mercury BLASTP: Accelerating Protein Sequence Alignment , 2008, TRETS.

[36]  Martin C. Herbordt,et al.  NCBI BLASTP on High-Performance Reconfigurable Computing Systems , 2015, TRETS.

[37]  Roger D. Chamberlain,et al.  Data Integration Tasks on Heterogeneous Systems Using OpenCL , 2019, IWOCL.

[38]  Dilma Da Silva,et al.  Adaptive task duplication using on-line bottleneck detection for streaming applications , 2012, CF '12.

[39]  Keith D. Underwood,et al.  RC-BLAST: towards a portable, cost-effective open source hardware implementation , 2005, IEEE International Parallel and Distributed Processing Symposium.

[40]  Chase Qishi Wu,et al.  Maximizing Workflow Throughput for Streaming Applications in Distributed Environments , 2010, 2010 Proceedings of 19th International Conference on Computer Communications and Networks.

[41]  Chen Yang,et al.  FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[42]  Chandler Chamberlain Roger Mitchell Scott Barnstorff Adam a Ahrens Controlling Daylight Reflectance with Cyber-physical Systems , 2019 .