Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures
暂无分享,去创建一个
Canqun Yang | Peng Zhang | Jianbin Fang | Tao Tang | Chun Huang | Zheng Wang | T. Tang | Z. Wang | Canqun Yang | Jianbin Fang | Chun Huang | Peng Zhang
[1] Hervé Paulino,et al. Stream Processing on Hybrid CPU/Intel® Xeon Phi™ Systems , 2018, Euro-Par.
[2] Michael F. P. O'Boyle,et al. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..
[3] Christoph W. Kessler,et al. Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems , 2011, IWMSE '11.
[4] Michael F. P. O'Boyle,et al. Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[5] J. Xu. OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .
[6] Michael F. P. O'Boyle,et al. Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code , 2014, CC.
[7] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[8] Michael F. P. O'Boyle,et al. Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[9] Bingsheng He,et al. MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors , 2015, IEEE Transactions on Parallel and Distributed Systems.
[10] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[11] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[12] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[13] Jeffrey S. Vetter,et al. A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..
[14] Wolfgang Ertel,et al. On the Definition of Speedup , 1994, PARLE.
[15] H. Hotelling. Analysis of a complex of statistical variables into principal components. , 1933 .
[16] Jason Maassen,et al. Performance Models for CPU-GPU Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[17] Yansong Feng,et al. Proteus: network-aware web browsing on heterogeneous mobile systems , 2018, CoNEXT.
[18] Hao Wang,et al. Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] Michael F. P. O'Boyle,et al. Using machine learning to partition streaming programs , 2013, ACM Trans. Archit. Code Optim..
[21] Alejandro Duran,et al. Heterogeneous Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[22] Bingsheng He,et al. Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors , 2018, J. Parallel Distributed Comput..
[23] Michael F. P. O'Boyle,et al. Smart, adaptive mapping of parallelism in the presence of external workload , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[24] Sabri Pllana,et al. HSTREAM: A Directive-Based Language Extension for Heterogeneous Stream Computing , 2018, 2018 IEEE International Conference on Computational Science and Engineering (CSE).
[25] Tao Tang,et al. Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[26] Barry Porter,et al. Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach , 2017 .
[27] Alexey L. Lastovetsky,et al. Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.
[28] Tao Wang,et al. Bootstrapping Parameter Space Exploration for Fast Tuning , 2018, ICS.
[29] Jie Shen,et al. Workload Partitioning for Accelerating Applications on Heterogeneous Platforms , 2016, IEEE Transactions on Parallel and Distributed Systems.
[30] B. Manly. Multivariate Statistical Methods : A Primer , 1986 .
[31] Jaejin Lee,et al. An Auto-Tuner for OpenCL Work-Group Size on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.
[32] Kenli Li,et al. UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters , 2018, ICPP.
[33] Canqun Yang,et al. Evaluating Multiple Streams on Heterogeneous Platforms , 2016, Parallel Process. Lett..
[34] Luis Miguel Sánchez,et al. Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications , 2017, International Journal of Parallel Programming.
[35] Zheng Wang,et al. Machine Learning in Compiler Optimization , 2018, Proceedings of the IEEE.
[36] Takahiro Katagiri,et al. Auto-Tuning on NUMA and Many-Core Environments with an FDM Code , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[37] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[38] Zheng Wang,et al. Adaptive optimization for OpenCL programs on embedded heterogeneous systems , 2017, LCTES.
[39] Lu Yuan,et al. Using Machine Learning to Optimize Web Interactions on Heterogeneous Mobile Systems , 2019, IEEE Access.
[40] Ling Gao,et al. Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.
[41] Eddy Z. Zhang,et al. KernelGen -- The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[42] P. Sadayappan,et al. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.
[43] Satoshi Matsuoka,et al. Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[44] Torsten Hoefler,et al. Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.
[45] Timothy B. Costa,et al. Optimizing Matrix Multiplication on Intel® Xeon Phi TH x200 Architecture , 2017, 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH).
[46] Chris Cummins,et al. End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[47] Michael F. P. O'Boyle,et al. Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.
[48] Jianbin Fang,et al. A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.
[49] Yehia El-khatib,et al. Adaptive deep learning model selection on embedded systems , 2018, LCTES.
[50] David Cox,et al. Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[51] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[52] Tao Tang,et al. Streaming Applications on Heterogeneous Platforms , 2016, NPC.
[53] Zheng Gong,et al. Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity , 2016, Int. J. High Perform. Comput. Appl..
[54] Bingsheng He,et al. Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach , 2015, Proc. VLDB Endow..
[55] Michael F. P. O'Boyle,et al. MILEPOST GCC: machine learning based research compiler , 2008 .
[56] Michael F. P. O'Boyle,et al. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.
[57] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..
[58] Xiaobing Feng,et al. Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.
[59] José Ignacio Benavides Benítez,et al. Performance models for asynchronous data transfers on consumer Graphics Processing Units , 2012, J. Parallel Distributed Comput..
[60] Tao Tang,et al. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance , 2016, Computing.
[61] Peng Zhang,et al. Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[62] John D. Owens,et al. GPU Computing , 2008, Proceedings of the IEEE.
[63] Erwin Kreyszig,et al. Advanced Engineering Mathematics 10th Edition , 2016 .
[64] Scott B. Baden,et al. Modeling and predicting performance of high performance computing applications on hardware accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[65] Yun Liang,et al. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[66] Michael F. P. O'Boyle,et al. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).
[67] Jianbin Fang,et al. Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture , 2019, International Journal of Parallel Programming.