Benchmarking Next Generation Hardware Platforms: An Experimental Approach

Heterogeneous multi-cores–platforms comprised of both general purpose and accelerator cores—are becoming increasingly common. Further, with processor designs in which there are many cores on a chip, a recent trend is to include functional and performance asymmetries to balance their power usage vs. performance requirements. Coupled with this trend in CPUs is the development of high end interconnects providing low latency and high throughput communication. Understanding the utility of such next generation platforms for future datacenter workloads requires investigations that evaluate the combined effects on workload of (1) processing units, (2) interconnect, and (3) usage models. For benchmarks, then, this requires functionality that makes it possible to easily yet separately vary different benchmark attributes that affect the performance observed for application-relevant metrics like throughput, end-toend latency, and the effects on both due to the presence of other concurrently running applications. To obtain these properties, benchmarks must be designed to test different and varying, rather than fixed, combinations of factors pertaining to their processing and communication behavior and their respective usage patterns (e.g., degree of burstiness). The ‘Nectere’ benchmarking framework is intended for understanding and evaluating next generation multicore platforms under varying workload conditions. This paper demonstrates two specific benchmarks constructed with Nectere: (1) a financial benchmark posing low-latency challenges, and (2) an image processing benchmark with high throughput expectations. Benchmark characteristics can be varied along dimensions that include their relative usage of heterogeneous processors, like CPUs vs. graphics processors (GPUs), and their use of the interconnect through variations in data sizes and communication rates. With Nectere, one can create a mix of workloads to study the effects of consolidation, and one can create both singleand multi-node versions of these benchmarks. Results presented in the paper evaluate workload ability or inability to share resources like GPUs or network interconnects, and the effects of such sharing on applications running in consolidated systems.

[1]  Li Zhao,et al.  QuickIA: Exploring heterogeneous architectures on real prototypes , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[2]  Adit Ranadive,et al.  ResourceExchange: Latency-Aware Scheduling in Virtualized Environments with High Performance Fabrics , 2011, 2011 IEEE International Conference on Cluster Computing.

[3]  N. Tolia,et al.  Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems , 2011, USENIX Annual Technical Conference.

[4]  Vishakha Gupta,et al.  Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies , 2011, VTDC '11.

[5]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[6]  Robert L. Grossman,et al.  Malstone: towards a benchmark for analytics on large data clouds , 2010, KDD '10.

[7]  Chen-Yong Cher,et al.  A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Torquati Massimo,et al.  Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed. , 2009 .

[10]  Carsten Binnig,et al.  How is the weather tomorrow?: towards a benchmark for the cloud , 2009, DBTest '09.

[11]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[12]  Patrick Horain,et al.  GpuCV: an opensource GPU-accelerated framework forimage processing and computer vision , 2008, ACM Multimedia.

[13]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[14]  A. Fox,et al.  Cloudstone : Multi-Platform , Multi-Language Benchmark and Measurement Tools for Web 2 . 0 , 2008 .

[15]  HarrisTim,et al.  Xen and the art of virtualization , 2003 .

[16]  Sameh Elnikety,et al.  Performance Comparison of Middleware Architectures for Generating Dynamic Web Content , 2003, Middleware.

[17]  S. Siwamogsatham 10 Gigabit Ethernet , 2000 .