LLVM-based automation of memory decoupling for OpenCL applications on FPGAs
暂无分享,去创建一个
Hamed Tabkhi | Samuel Rogers | Arnab A. Purkayastha | Suhas A. Shiddibhavi | H. Tabkhi | Samuel Rogers
[1] James C. Hoe,et al. Automatic multithreaded pipeline synthesis from transactional datapath specifications , 2010, Design Automation Conference.
[2] Santosh G. Abraham,et al. Effective stream-based and execution-based data prefetching , 2004, ICS '04.
[3] Mehdi Baradaran Tahoori,et al. Energy Efficient Scientific Computing on FPGAs using OpenCL , 2017, FPGA.
[4] Gunar Schirner,et al. DS-DSE: Domain-specific design space exploration for streaming applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[5] David J. Lilja,et al. Data prefetch mechanisms , 2000, CSUR.
[6] David R. Kaeli,et al. Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGA , 2016, SIGARCH Comput. Archit. News.
[7] Andrew C. Ling,et al. An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .
[8] Tao Chen,et al. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9] Peter Y. K. Cheung,et al. Outer Loop Pipelining for Application Specific Datapaths in FPGAs , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[10] Zhiru Zhang,et al. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[11] Hamed Tabkhi,et al. Locality Aware Memory Assignment and Tiling , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
[12] Gunar Schirner,et al. Function-Level Processor (FLP): Raising efficiency by operating at function granularity for market-oriented MPSoC , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.
[13] Sotirios G. Ziavras,et al. Customized kernel execution on reconfigurable hardware for embedded applications , 2009, Microprocess. Microsystems.
[14] Alexander V. Veidenbaum,et al. Multiple stream tracker: a new hardware stride prefetcher , 2014, Conf. Computing Frontiers.
[15] Satoshi Matsuoka,et al. Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.
[17] Seth H. Pugsley,et al. Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[18] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[19] Wei Zhang,et al. A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[20] Wei Zhang,et al. FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[21] Pingfan Meng,et al. Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK , 2014, 2014 International Conference on Field-Programmable Technology (FPT).
[22] Jason Cong,et al. Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[23] David Bernstein,et al. Compiler techniques for data prefetching on the PowerPC , 1995, PACT.
[24] Robert J. Halstead,et al. Exploring irregular memory accesses on FPGAs , 2011, IA3 '11.
[25] Jianbin Fang,et al. A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.
[26] Shankar Balachandran,et al. Hardware prefetchers for emerging parallel applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[27] Vijay Janapa Reddi,et al. PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.
[28] Martin Margala,et al. High level programming of FPGAs for HPC and data centric applications , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).
[29] Hongbo Rong,et al. Single-dimension software pipelining for multi-dimensional loops , 2004 .
[30] Robert J. Halstead,et al. Compiled multithreaded data paths on FPGAs for dynamic workloads , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).
[31] Ronald Tetzlaff,et al. A new high-speed real-time video processing platform , 2014, 2014 14th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA).
[32] John Wawrzynek,et al. Architectural synthesis of computational pipelines with decoupled memory access , 2014, 2014 International Conference on Field-Programmable Technology (FPT).
[33] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[34] Henry M. Levy,et al. An architecture for software-controlled data prefetching , 1991, ISCA '91.
[35] Yu Wang,et al. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.
[36] Hamed Tabkhi,et al. Taxonomy of Spatial Parallelism on FPGAs for Massively Parallel Applications , 2018, 2018 31st IEEE International System-on-Chip Conference (SOCC).
[37] Jungwon Kim,et al. OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[38] Jason Helge Anderson,et al. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.
[39] Michael J. Flynn,et al. Hardware and software cache prefetching techniques for MPEG benchmarks , 2000, IEEE Trans. Circuits Syst. Video Technol..
[40] Wei-Chung Hsu,et al. The performance of runtime data cache prefetching in a dynamic optimization system , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..
[41] Kenta Kasai,et al. Flexible non-binary LDPC decoding on FPGAs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[42] Peng Zhang,et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[43] Collin McCurdy,et al. Diagnosis and optimization of application prefetching performance , 2013, ICS '13.
[44] Jason Cong,et al. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[45] Doris Chen,et al. Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).
[46] Torsten Hoefler,et al. Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .
[47] Jason Cong,et al. Understanding Performance Differences of FPGAs and GPUs: (Abtract Only) , 2018, FPGA.
[48] Jing Li,et al. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.
[49] Tomasz Kryjak,et al. Real-time hardware–software embedded vision system for ITS smart camera implemented in Zynq SoC , 2018, Journal of Real-Time Image Processing.
[50] Zhiru Zhang,et al. Multithreaded pipeline synthesis for data-parallel kernels , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[51] David R. Kaeli,et al. Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.