Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems
暂无分享,去创建一个
Min Xie | Chen Juan | Hao Wang | Jianbin Fang | Yuan Yuan | Tao Tang | Chun Huang | Zheng Wang | Wenxu Zheng | Feihao Wu | Xiaodong Pan | Xiaole Sun | T. Tang | Xiaole Sun | Z. Wang | Chen Juan | Feihao Wu | Jianbin Fang | Min Xie | Yuan Yuan | Chun Huang | Xiaodong Pan | Hao Wang | Wenxu Zheng
[1] Edgar Gabriel,et al. A Tool for Optimizing Runtime Parameters of Open MPI , 2008, PVM/MPI.
[2] Chris Cummins,et al. End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[3] Michael F. P. O'Boyle,et al. Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[4] Michael F. P. O'Boyle,et al. Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code , 2014, CC.
[5] Shweta Jha,et al. Impact and Limitations of Point-to-Point Performance on Collective Algorithms , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).
[6] Michael F. P. O'Boyle,et al. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.
[7] Jianbin Fang,et al. Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture , 2019, International Journal of Parallel Programming.
[8] Zheng Wang,et al. Fast Automatic Heuristic Construction Using Active Learning , 2014, LCPC.
[9] Xuejun Yang,et al. Tianhe-1A Interconnect and Message-Passing Services , 2012, IEEE Micro.
[10] Michael F. P. O'Boyle,et al. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..
[11] Kevin Harms,et al. Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Radu Prodan,et al. Tuning MPI Runtime Parameter Setting for High Performance Computing , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.
[13] Michael F. P. O'Boyle,et al. Using machine learning to partition streaming programs , 2013, ACM Trans. Archit. Code Optim..
[14] Alexey L. Lastovetsky,et al. Hierarchical redesign of classic MPI reduction algorithms , 2016, The Journal of Supercomputing.
[15] Yi Zheng,et al. The TH Express high performance interconnect networks , 2014, Frontiers of Computer Science.
[16] Michael F. P. O'Boyle,et al. Smart, adaptive mapping of parallelism in the presence of external workload , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[17] Jack J. Dongarra,et al. Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[18] Xin Yuan,et al. Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.
[19] Jie Wang,et al. Optimizing MPI Runtime Parameter Settings by Using Machine Learning , 2009, PVM/MPI.
[20] Xiaofang Zhao,et al. Multi-core aware optimization for MPI collectives , 2008, 2008 IEEE International Conference on Cluster Computing.
[21] Alexey L. Lastovetsky,et al. Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms , 2015, Simul. Model. Pract. Theory.
[22] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[23] Pavlos Petoumenos,et al. Minimizing the cost of iterative compilation with active learning , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[24] Zizhong Chen,et al. Runtime Optimization of Broadcast Communications Using Dynamic Network Topology Information from MPI , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.
[25] Barry Porter,et al. Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach , 2017 .
[26] 廖湘科,et al. High Performance Interconnect Network for Tianhe System , 2015 .
[27] Eli Upfal,et al. Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..
[28] Torsten Hoefler,et al. Cache-Oblivious MPI All-to-All Communications Based on Morton Order , 2018, IEEE Transactions on Parallel and Distributed Systems.
[29] Peng Zhang,et al. Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[30] Xin Yuan,et al. Message scheduling for all-to-all personalized communication on ethernet switched clusters , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[31] Zheng Wang,et al. Machine Learning in Compiler Optimization , 2018, Proceedings of the IEEE.
[32] Zheng Wang,et al. Adaptive optimization for OpenCL programs on embedded heterogeneous systems , 2017, LCTES.
[33] Michael F. P. O'Boyle,et al. Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.
[34] Xin Yuan,et al. CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.
[35] Yansong Feng,et al. Proteus: network-aware web browsing on heterogeneous mobile systems , 2018, CoNEXT.
[36] Michael F. P. O'Boyle,et al. OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.
[37] Michael F. P. O'Boyle,et al. A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.
[38] Ling Gao,et al. Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.
[39] Michael F. P. O'Boyle,et al. Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[40] Michael F. P. O'Boyle,et al. Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.
[41] Laxmikant V. Kalé,et al. A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[42] Yehia El-khatib,et al. Adaptive deep learning model selection on embedded systems , 2018, LCTES.