论文信息 - Benchmarking, Measuring, and Optimizing: Second BenchCouncil International Symposium, Bench 2019, Denver, CO, USA, November 14–16, 2019, Revised Selected Papers

Benchmarking, Measuring, and Optimizing: Second BenchCouncil International Symposium, Bench 2019, Denver, CO, USA, November 14–16, 2019, Revised Selected Papers

In this talk, we will cover the increasing gaps between headline performance and application performance on Frontera and the last several generations of TACC supercomputers. We will also discuss the challenges of developing a new benchmark suite for the upcoming Leadership-Class Computing Facility, and solicit community input on capability benchmarks. Bio: Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin since 2018 and Executive Director of the Texas Advanced Computing Center (TACC) since 2014, is a nationally recognized leader in high performance computing. He is the principal investigator (PI) for a National Science Foundation (NSF) grant to deploy Frontera, which is the fastest supercomputer at any U.S. university. Stanzione is also the PI of TACC’s Stampede2 and Wrangler systems, supercomputers for high performance computing and for data-focused applications, respectively. For six years he was co-PI of CyVerse, a large-scale NSF life sciences cyberinfrastructure. Stanzione was also a co-PI for TACC’s Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione received his bachelor’s degree in electrical engineering and his master’s degree and doctorate in computer engineering from Clemson University. Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University Abstract: This talk will focus on challenges in designing benchmarks and middleware for convergent HPC, Deep Learning, and Big Data Analytics Software stacks for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the OSU Micro-Benchmarks (OMB) Suite and associated middleware for designing runtime environments for MPI+X programming models by taking into account support for multi-core systems (x86, OpenPOWER, and ARM), high-performance networks, and GPGPUs (including GPUDirect RDMA). Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC, and HBase), Spark, and Memcached, together with the OSU HiBD benchmarks (http://hibd.cse.ohio-state.edu) will be presented for Big Data Analytics. For the Deep Learning domain, we will focus on a set of different benchmarks and profiling tools to deliver scalable DNN training with Horovod and TensorFlow using MVAPICH2-GDR MPI library (http://hidl.cse.ohio-state. edu). This talk will focus on challenges in designing benchmarks and middleware for convergent HPC, Deep Learning, and Big Data Analytics Software stacks for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the OSU Micro-Benchmarks (OMB) Suite and associated middleware for designing runtime environments for MPI+X programming models by taking into account support for multi-core systems (x86, OpenPOWER, and ARM), high-performance networks, and GPGPUs (including GPUDirect RDMA). Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC, and HBase), Spark, and Memcached, together with the OSU HiBD benchmarks (http://hibd.cse.ohio-state.edu) will be presented for Big Data Analytics. For the Deep Learning domain, we will focus on a set of different benchmarks and profiling tools to deliver scalable DNN training with Horovod and TensorFlow using MVAPICH2-GDR MPI library (http://hidl.cse.ohio-state. edu). Bio: Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at The Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP, and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,025 organizations worldwide (in 89 countries). More than 600,000 downloads of this software have taken place from the project’s site. This software is empowering several InfiniBand clusters (including the 3rd, 5th, 8th, 15th, 16th, 19th, and 31st ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop, and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohiostate.edu) are also publicly available. These libraries are currently being used by more than 315 organizations in 35 countries. More than 31,300 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/ panda. xiv D. K. (DK) Panda

[1] Wanling Gao,et al. Data motifs: a lens towards fully understanding big data and AI workloads , 2018, PACT.

[2] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[3] Katerina J. Argyraki,et al. How to Measure the Killer Microsecond , 2017, CCRV.

[4] Ameet Talwalkar,et al. MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[5] Alexandru Iosup,et al. An Empirical Performance Evaluation of GPU-Enabled Graph-Processing Systems , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[6] Rajeev Dehejia,et al. Propensity Score-Matching Methods for Nonexperimental Causal Studies , 2002, Review of Economics and Statistics.

[7] Osamu Watanabe,et al. Developing Efficient Implementations of Bellman-Ford and Forward-Backward Graph Algorithms for NEC SX-ACE , 2018, Supercomput. Front. Innov..

[8] Ruocheng Guo,et al. Learning Individual Treatment Effects from Networked Observational Data , 2019, IJCAI.

[9] Endong Wang,et al. Intel Math Kernel Library , 2014 .

[10] Lloyd N. Trefethen,et al. Fourth-Order Time-Stepping for Stiff PDEs , 2005, SIAM J. Sci. Comput..

[11] Guangli Li,et al. XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips , 2019, Bench.

[12] Gerhard Wellein,et al. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[13] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[14] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Holger Karl,et al. DCT2Gen: A traffic generator for data centers , 2016, Comput. Commun..

[16] Andreas Hellander,et al. HarmonicIO: Scalable Data Stream Processing for Scientific Datasets , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[17] Michael A. Bender,et al. File Systems Fated for Senescence? Nonsense, Says Science! , 2017, FAST.

[18] Amin Vahdat,et al. Carousel: Scalable Traffic Shaping at End Hosts , 2017, SIGCOMM.

[19] Fan Zhang,et al. AIoT Bench: Towards Comprehensive Benchmarking Mobile and Embedded Device Intelligence , 2018, Bench.

[20] Eero Vainikko,et al. Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters , 2015, Parallel Comput..

[21] Kejiang Ye,et al. Imbalance in the cloud: An analysis on Alibaba cluster trace , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[22] Arne-Jørgen Berre,et al. Evidence Based Big Data Benchmarking to Improve Business Performance , 2018 .

[23] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[24] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25] Winfried Auzinger,et al. Practical splitting methods for the adaptive integration of nonlinear evolution equations. Part II: Comparisons of local error estimation and step-selection strategies for nonlinear Schrödinger and wave equations , 2019, Comput. Phys. Commun..

[26] Lovisa Lugnegård. Building a high throughput microscope simulator using the Apache Kafka streaming framework , 2018 .

[27] Adrian Schüpbach,et al. The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[28] Yiying Tong,et al. FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[29] Yuchen Zhang,et al. HPC AI500: A Benchmark Suite for HPC AI Systems , 2018, Bench.

[30] Greg Linden,et al. Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[31] George Karypis,et al. Item-based top-N recommendation algorithms , 2004, TOIS.

[32] Benson K. Muite,et al. A comparison of CPU and GPU performance for Fourier pseudospectral simulations of the Navier-Stokes, Cubic Nonlinear Schrodinger and Sine Gordon Equations , 2012 .

[33] R. Wollman,et al. High throughput microscopy: from raw images to discoveries , 2007, Journal of Cell Science.

[34] Dusan Markovic,et al. Benchmarking performance and energy efficiency of microprocessors for wireless sensor network applications , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[35] Archana Ganapathi,et al. The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[36] Hans De Sterck,et al. Algorithmic Acceleration of Parallel ALS for Collaborative Filtering: Speeding up Distributed Big Data Recommendation in Spark , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[37] Abhinandan Das,et al. Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[38] Chao Li,et al. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[39] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[41] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[43] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[44] Tilmann Rabl,et al. Big Data Benchmark Compendium , 2015, TPCTC.

[45] Vadim D. Levchenko,et al. Performance Limits Study of Stencil Codes on Modern GPGPUs , 2019, Supercomput. Front. Innov..

[46] Stephen Bonner,et al. Causal embeddings for recommendation , 2017, RecSys.

[47] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[48] David Flynn,et al. DFS: A file system for virtualized flash storage , 2010, TOS.

[49] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[50] Wenguang Chen,et al. Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[51] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[52] Liu Bingbing. CloudBM:a Benchmark for Cloud Data Management Systems , 2012 .

[53] Chen Yang,et al. AstroServ: Distributed Database for Serving Large-Scale Full Life-Cycle Astronomical Data , 2018, BigSDM.

[54] Luca Benini,et al. GAP-8: A RISC-V SoC for AI at the Edge of the IoT , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[55] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[56] Franck Cappello,et al. Failure prediction for HPC systems and applications , 2013, Int. J. High Perform. Comput. Appl..

[57] Ramesh Radhakrishnan,et al. Demystifying the MLPerf Benchmark Suite , 2019, ArXiv.

[58] Yann LeCun,et al. Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59] Mihaela van der Schaar,et al. GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets , 2018, ICLR.

[60] Christopher Torng,et al. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips , 2018, IEEE Micro.

[61] Dennis M. Wilkinson,et al. Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[62] Chuan Wu,et al. Deep Learning-based Job Placement in Distributed Machine Learning Clusters , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[63] Minghe Yu,et al. AIBench: An Industry Standard Internet Service AI Benchmark Suite , 2019, ArXiv.

[64] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[65] Ian T. Foster,et al. Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[66] Maosen Chen,et al. An Efficient Implementation of the ALS-WR Algorithm on x86 CPUs , 2019, Bench.

[67] Ruocheng Guo,et al. Causal Learning in Question Quality Improvement , 2019, Bench.

[68] Bernhard Schölkopf,et al. Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks , 2014, J. Mach. Learn. Res..

[69] Frederico Pratas,et al. Cache-aware Roofline model: Upgrading the loft , 2014, IEEE Computer Architecture Letters.

[70] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[71] Aaron Halfaker,et al. Identifying Semantic Edit Intentions from Revisions in Wikipedia , 2017, EMNLP.

[72] B. Scheers,et al. Column Store for GWAC: A High-cadence, High-density, Large-scale Astronomical Light Curve Pipeline and Distributed Shared-nothing Database , 2016 .

[73] Jeffrey A. Smith,et al. Does Matching Overcome Lalonde's Critique of Nonexperimental Estimators? , 2000 .

[74] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[75] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[76] Fan Zhang,et al. AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking , 2018, Bench.

[77] Li Zhang,et al. GPU-accelerated Large-Scale Non-negative Matrix Factorization Using Spark , 2018, CollaborateCom.

[78] Wanling Gao,et al. DCMIX: Generating Mixed Workloads for the Cloud Data Center , 2018, Bench.

[79] Rajeev Balasubramonian,et al. Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[80] Tianshu Hao,et al. The Implementation and Optimization of Matrix Decomposition Based Collaborative Filtering Task on X86 Platform , 2019, Bench.

[81] Xiao Wang,et al. AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs , 2019, SC.

[82] Shiguang Shan,et al. Improving 2D Face Recognition via Discriminative Face Depth Estimation , 2018, 2018 International Conference on Biometrics (ICB).

[83] Jie Huang,et al. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[84] Thorsten Kurth,et al. Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system , 2020, Concurr. Comput. Pract. Exp..

[85] Hari Sundar,et al. FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube , 2014, SIAM J. Sci. Comput..

[86] Herodotos Herodotou,et al. MapReduce programming and cost-based optimization? , 2011, Proc. VLDB Endow..

[87] Jennifer L. Hill,et al. Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[88] F. Krogh,et al. Solving Ordinary Differential Equations , 2019, Programming for Computations - Python.

[89] Jay Kreps,et al. Kafka : a Distributed Messaging System for Log Processing , 2011 .

[90] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91] Zhengming Ding,et al. Latent Tensor Transfer Learning for RGB-D Action Recognition , 2014, ACM Multimedia.

[92] Johan Karlsson,et al. Adapting the Secretary Hiring Problem for Optimal Hot-Cold Tier Placement Under Top-K Workloads , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[93] G. Duncan,et al. Economic deprivation and early childhood development. , 1994, Child development.

[94] Ruocheng Guo,et al. A Practical Data Repository for Causal Learning with Big Data , 2019, Bench.

[95] Rishabh Mehrotra,et al. The Music Streaming Sessions Dataset , 2018, WWW.

[96] Ruocheng Guo,et al. Diffusion in Social Networks , 2015, SpringerBriefs in Computer Science.

[97] Michael Stonebraker,et al. A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[98] Nicolas Gillis,et al. Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization , 2011, Neural Computation.

[99] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[100] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[101] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102] Thorsten Joachims,et al. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[103] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[104] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[105] D. Rubin. [On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.] Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies , 1990 .

[106] Max Welling,et al. Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[107] Xu Wen,et al. Improving RGB-D Face Recognition via Transfer Learning from a Pretrained 2D Network , 2019, Bench.

[108] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[109] Ruocheng Guo,et al. Robust Cyberbullying Detection with Causal Interpretation , 2019, WWW.

[110] Chao Yang,et al. 10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[111] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[112] Gu-Yeon Wei,et al. Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[113] Wei Cao,et al. DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[114] Ninghui Sun,et al. DianNao family , 2016, Commun. ACM.

[115] Yoshua Bengio,et al. Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[116] Rafal Zdunek,et al. Distributed Nonnegative Matrix Factorization with HALS Algorithm on Apache Spark , 2018, ICAISC.

[117] John Langford,et al. The offset tree for learning with partial labels , 2008, KDD.

[118] Steve B. Jiang,et al. Intelligent Parameter Tuning in Optimization-Based Iterative CT Reconstruction via Deep Reinforcement Learning , 2017, IEEE Transactions on Medical Imaging.

[119] Tianshi Chen,et al. Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[120] Samuel Williams,et al. Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[121] Houman Homayoun,et al. Hadoop Workloads Characterization for Performance and Energy Efficiency Optimizations on Microservers , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[122] Ami Marowka,et al. On Performance Analysis of a Multithreaded Application Parallelized by Different Programming Models Using Intel VTune , 2011, PaCT.

[123] Reinhold Weicker,et al. Dhrystone: a synthetic systems programming benchmark , 1984, CACM.

[124] Ryen W. White,et al. Clarifications and question specificity in synchronous social Q&A , 2013, CHI Extended Abstracts.

[125] Jure Leskovec,et al. Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[126] D. Almond,et al. The Costs of Low Birth Weight , 2004 .

[127] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[128] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[129] Lei Li,et al. CGMH: Constrained Sentence Generation by Metropolis-Hastings Sampling , 2018, AAAI.

[130] Minyi Guo,et al. PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms , 2019, Bench.

[131] Zhibin Yu,et al. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace , 2018, SoCC.

[132] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[133] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[134] Ernst Hairer,et al. Simulating Hamiltonian dynamics , 2006, Math. Comput..

[135] Zihan Jiang,et al. Performance Analysis of Cambricon MLU100 , 2019, Bench.

[136] K.W. Bowyer,et al. Using a Multi-Instance Enrollment Representation to Improve 3D Face Recognition , 2007, 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems.

[137] Fernando Ortega,et al. A non negative matrix factorization for collaborative filtering recommender systems based on a Bayesian probabilistic model , 2016, Knowl. Based Syst..

[138] Jim Webber,et al. A programmatic introduction to Neo4j , 2018, SPLASH '12.

[139] Li Fu,et al. Improve Image Classification by Convolutional Network on Cambricon , 2019, Bench.

[140] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[141] Ola Spjuth,et al. SNIC Science Cloud (SSC): A National-Scale Cloud Infrastructure for Swedish Academia , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[142] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[143] Junwei Han,et al. CNNs-Based RGB-D Saliency Detection via Cross-View Transfer and Multiview Fusion. , 2018, IEEE transactions on cybernetics.

[144] Nikolai Joukov,et al. A nine year study of file system and storage benchmarking , 2008, TOS.

[145] Torsten Hoefler,et al. Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[146] Karline Soetaert,et al. Solving Ordinary Differential Equations in R , 2012 .

[147] Ching-Yung Lin,et al. GraphBIG: understanding graph computing in the context of industrial solutions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[148] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[149] Luca Maria Gambardella,et al. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[150] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[151] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[152] Mikko H. Lipasti,et al. BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[153] Matteo Parsani,et al. More efficient time integration for Fourier pseudospectral DNS of incompressible turbulence , 2018, International Journal for Numerical Methods in Fluids.

[154] Lei Zou,et al. gStore: Answering SPARQL Queries via Subgraph Matching , 2011, Proc. VLDB Endow..

[155] Andrea C. Arpaci-Dusseau,et al. Generating realistic impressions for file-system benchmarking , 2009, TOS.

[156] Ian T. Foster,et al. Jetstream: a self-provisioned, scalable science and engineering cloud environment , 2015, XSEDE.

[157] Andrew S. Cassidy,et al. A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.

[159] Eunyoung Jeong,et al. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[160] Rainer Gemulla,et al. Distributed Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[161] Marwan Mattar,et al. Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[162] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[163] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[164] Ning Li,et al. Solving the Klein-Gordon equation using fourier spectral methods: a benchmark test for computer performance , 2015, SpringSim.

[165] Michael J. Freedman,et al. SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[166] Yee Whye Teh,et al. Causal Inference via Kernel Deviance Measures , 2018, NeurIPS.

[167] R. Lalonde. Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[168] Brandon Lucia,et al. Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing , 2019, HPDC.

[169] Anil K. Jain,et al. Face recognition: Some challenges in forensics , 2011, Face and Gesture 2011.

[170] Fan Xia,et al. BSMA: A Benchmark for Analytical Queries over Social Media Data , 2014, Proc. VLDB Endow..

[171] Alexandros G. Dimakis,et al. Cost-Optimal Learning of Causal Graphs , 2017, ICML.

[172] Yanjun Wu,et al. RVTensor: A Light-Weight Neural Network Inference Framework Based on the RISC-V Architecture , 2019, Bench.

[173] Inderjit S. Dhillon,et al. Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[174] David R. Kaeli,et al. DNNMark: A Deep Neural Network Benchmark Suite for GPUs , 2017, GPGPU@PPoPP.

[175] Bengt Fornberg,et al. A practical guide to pseudospectral methods: Introduction , 1996 .

[176] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[177] Yan Li,et al. CAPES: Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement Learning , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[178] Reynold Xin,et al. GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[179] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[180] Ruocheng Guo,et al. Linked Causal Variational Autoencoder for Inferring Paired Spillover Effects , 2018, CIKM.

[181] Alexandros G. Dimakis,et al. Learning Causal Graphs with Small Interventions , 2015, NIPS.

[182] Yoshua Bengio,et al. Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[183] Mahadev Satyanarayanan,et al. OpenFace: A general-purpose face recognition library with mobile applications , 2016 .

[184] Stefanos Zafeiriou,et al. Statistical non-rigid ICP algorithm and its application to 3D face alignment , 2017, Image Vis. Comput..

[185] Uri Shalit,et al. Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[186] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[187] Zhuo Liu,et al. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[188] Gregory R. Ganger,et al. Geriatrix: Aging what you see and what you don't see. A file system aging approach for modern storage systems , 2018, USENIX Annual Technical Conference.

[189] Nancy Wilkins-Diehr,et al. XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[190] H. Chipman,et al. BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[191] Walter Karlen,et al. Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks , 2018, ArXiv.

[192] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[193] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[194] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[195] Dhanya Sridhar,et al. Using Text Embeddings for Causal Inference , 2019, ArXiv.

[196] Binyu Zang,et al. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[197] Weitong Chen,et al. Enhancing recommendation on extremely sparse data with blocks-coupled non-negative matrix factorization , 2018, Neurocomputing.

[198] Ashish Sureka,et al. Chaff from the wheat: characterization and modeling of deleted questions on stack overflow , 2014, WWW.

[199] YU WANG,et al. A Survey of FPGA-Based Neural Network Inference Accelerator , 2019 .

[200] Ruocheng Guo,et al. Adaptive Unsupervised Feature Selection on Attributed Networks , 2019, KDD.

[201] Constantin F. Aliferis,et al. The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[202] RalfHiutmut Gtiting,et al. GraphDB : Modeling and Querying Graphs in Databases , 1998 .

[203] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[204] Timothy G. Armstrong,et al. LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[205] Gary Bradski,et al. Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[206] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[207] Kai Hwang,et al. Edge AIBench: Towards Comprehensive End-to-end Edge Computing Benchmarking , 2018, Bench.

[208] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[209] Stijn Eyerman,et al. Many-Core Graph Workload Analysis , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[210] Daniel Raumer,et al. MoonGen: A Scriptable High-Speed Packet Generator , 2014, Internet Measurement Conference.

[211] S. McIntosh-Smith,et al. Scaling Results From the First Generation of Arm-based Supercomputers , 2019 .

[212] L Sirovich,et al. Low-dimensional procedure for the characterization of human faces. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[213] Andreas Hellander,et al. BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data , 2018, BMC Bioinform..

[214] John Shalf,et al. HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems , 2014 .

[215] Donald B. Rubin,et al. Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .

[216] Jiming Liu,et al. Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Social Collaborative Filtering by Trust , 2022 .

[217] Tommi S. Jaakkola,et al. Sequence to Better Sequence: Continuous Revision of Combinatorial Structures , 2017, ICML.

[218] Matthew G. Knepley,et al. A performance spectrum for parallel computational frameworks that solve PDEs , 2017, Concurr. Comput. Pract. Exp..

[219] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[220] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[221] Lizy Kurian John,et al. Benchmarking Big Data Systems: A Review , 2018, IEEE Transactions on Services Computing.

[222] Stacy Patterson,et al. EdgeBench: Benchmarking Edge Computing Platforms , 2018, 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion).

[223] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[224] Dieter Kranzlmüller,et al. glogin - a multifunctional, interactive tunnel into the grid , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[225] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[226] Jaewon Lee,et al. WSMeter: A Performance Evaluation Methodology for Google's Production Warehouse-Scale Computers , 2018, ASPLOS.

[227] Tao Tang,et al. Efficient and Portable ALS Matrix Factorization for Recommender Systems , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[228] Sunita Chandrasekaran,et al. NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model , 2014, LCPC.

[229] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[230] Daisuke Takahashi,et al. Reproducibility in Benchmarking Parallel Fast Fourier Transform based Applications , 2019, ICPE Companion.

[231] Bronis R. de Supinski,et al. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[232] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[233] Yang Chen,et al. Data Management Challenges and Real-Time Processing Technologies in Astronomy , 2017 .

[234] Jack J. Dongarra,et al. Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy , 2008, TOMS.

[235] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[236] Chunjie Luo,et al. BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking , 2013, WBDB.

[237] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[238] Isabelle Guyon,et al. Design and Analysis of the Causation and Prediction Challenge , 2008, WCCI Causation and Prediction Challenge.

[239] Hassan Chafi,et al. The LDBC Social Network Benchmark: Interactive Workload , 2015, SIGMOD Conference.

[240] M. Turk,et al. Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[241] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[242] K. Sachs,et al. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[243] Raouf Boutaba,et al. Characterizing Task Usage Shapes in Google Compute Clusters , 2011 .

[244] Razvan Pascanu,et al. Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[245] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[246] Debajyoti Mukhopadhyay,et al. Matrix Factorization Model in Collaborative Filtering Algorithms: A Survey , 2015 .

[247] Jie Huang,et al. Benchmarking modern distributed streaming platforms , 2016, 2016 IEEE International Conference on Industrial Technology (ICIT).

[248] S. Giordano,et al. BRUNO: A high performance traffic generator for network processor , 2008, 2008 International Symposium on Performance Evaluation of Computer and Telecommunication Systems.

[249] Ali Anwar,et al. Characterizing Co-located Datacenter Workloads: An Alibaba Case Study , 2018, APSys.

[250] Reynold Xin,et al. Apache Spark , 2016 .

[251] K. Dosaka,et al. A 40GOPS 250mW massively parallel processor based on matrix architecture , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[252] Jure Leskovec,et al. Discovering value from community activity on focused question answering sites: a case study of stack overflow , 2012, KDD.

[253] Hoi-Jun Yoo,et al. A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.

[254] David H. Bailey,et al. The NAS Parallel Benchmarks 2.0 , 2015 .

[255] David E. Keyes,et al. Efficiency of High Order Spectral Element Methods on Petascale Architectures , 2016, ISC.

[256] Omer Khan,et al. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[257] Jack Dongarra,et al. A new metric for ranking high-performance computing systems , 2016, National Science Review.

[258] David A. Patterson,et al. A new golden age for computer architecture , 2019, Commun. ACM.

[259] Seif Haridi,et al. Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[260] Raj Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[261] Philippe Owezarski,et al. OSNT: open source network tester , 2014, IEEE Network.

[262] Philippe Couvee,et al. Recurrent Neural Network for Classifying of Hpc Applications , 2019, 2019 Spring Simulation Conference (SpringSim).

[263] Feiyi Wang,et al. Diving into petascale production file systems through large scale profiling and analysis , 2017, PDSW-DISCS@SC.

[264] Intel ® Guide for Developing Multithreaded Applications Part 1 : Application Threading and Synchronization Summary , 2010 .

[265] Krisztian Balog,et al. Identifying Unclear Questions in Community Question Answering Websites , 2019, ECIR.

[266] Joshua M. Stuart,et al. The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[267] Nectarios Koziris,et al. SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms , 2018, ACM Trans. Math. Softw..

[268] Matteo Parsani,et al. Fully Implicit Time Stepping Can Be Efficient on Parallel Computers , 2019, Supercomput. Front. Innov..

[269] Chiara Francalanci,et al. Relating Big Data Business and Technical Performance Indicators , 2018 .

[270] Zheng Wang,et al. Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[271] Y. Abdulkadir. Comparison of Finite Difference Schemes for the Wave Equation Based on Dispersion , 2015 .

[272] Manaal Faruqui,et al. Identifying Well-formed Natural Language Questions , 2018, EMNLP.

[273] María S. Pérez-Hernández,et al. Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[274] Nicolas Gillis,et al. Accelerating Nonnegative Matrix Factorization Algorithms Using Extrapolation , 2018, Neural Computation.

[275] Mats Hamrud,et al. Accelerating Extreme-Scale Numerical Weather Prediction , 2015, PPAM.

[276] Ray Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[277] Alessandro Bozzon,et al. Asking the right question in collaborative q&a systems , 2014, HT.

[278] Samuel Williams,et al. Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[279] Benoît Meister,et al. Runnemede: An architecture for Ubiquitous High-Performance Computing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[280] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[281] Sam Harbaugh,et al. Timing studies using a synthetic Whetstone benchmark , 1984, ALET.

[282] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[283] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[284] Uri Shalit,et al. Learning Representations for Counterfactual Inference , 2016, ICML.

[285] Huiqian Niu,et al. An Implementation of ResNet on the Classification of RGB-D Images , 2019, Bench.

[286] Jared S. Murray,et al. Atlantic Causal Inference Conference (ACIC) Data Analysis Challenge 2017 , 2019, 1905.09515.

[287] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[288] B. Fornberg. Generation of finite difference formulas on arbitrarily spaced grids , 1988 .