Allocation wall: a limiting factor of Java applications on emerging multi-core platforms

Multi-core processors are widely used in computer systems. As the performance of microprocessors greatly exceeds that of memory, the memory wall becomes a limiting factor. It is important to understand how the large disparity of speed between processor and memory influences the performance and scalability of Java applications on emerging multi-core platforms. In this paper, we studied two popular Java benchmarks, SPECjbb2005 and SPECjvm2008, on multi-core platforms including Intel Clovertown and AMD Phenom. We focus on the "partially scalable" benchmark programs. With smaller number of CPU cores these programs scale perfectly, but when more cores and software threads are used, the slope of the scalability curve degrades dramatically. We identified a strong correlation between scalability, object allocation rate and memory bus write traffic in our experiments with our partially scalable programs. We find that these applications allocate large amounts of memory and consume almost all the memory write bandwidth in our hardware platforms. Because the write bandwidth is so limited, we propose the following hypothesis: the scalability and performance is limited by the object allocation on emerging multi-core platforms for those objects-allocation intensive Java applications, as if these applications are running into an "allocation wall". In order to verify this hypothesis, several experiments are performed, including measuring key architecture level metrics, composing a micro-benchmark program, and studying the effect of modifying some of the "partially scalable" programs. All the experiments strongly suggest the existence of the allocation wall.

[1]  Elliot K. Kolodner,et al.  Heap profiling for space-efficient Java , 2001, PLDI '01.

[2]  David M. Ungar,et al.  Generation Scavenging: A non-disruptive high performance storage reclamation algorithm , 1984, SDE 1.

[3]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[4]  Aamer Jaleel,et al.  Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Robert Fenichel,et al.  A LISP garbage-collector for virtual-memory computer systems , 1969, CACM.

[6]  John McCarthy,et al.  Recursive functions of symbolic expressions and their computation by machine, Part I , 1960, Commun. ACM.

[7]  Yanping Wang,et al.  SPECjvm2008 Performance Characterization , 2009, SPEC Benchmark Workshop.

[8]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[9]  Dean M. Tullsen,et al.  Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[10]  Jeffrey K. Hollingsworth,et al.  NUMA-aware Java heaps for server applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11]  J. Morris Chang,et al.  Object allocation and memory contention study of Java multithreaded applications , 2004, IEEE International Conference on Performance, Computing, and Communications, 2004.

[12]  Sigmund Cherem,et al.  Uniqueness inference for compile-time object deallocation , 2007, ISMM '07.

[13]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  Pramod G. Joisha,et al.  A principled approach to nondeferred reference-counting garbage collection , 2008, VEE '08.

[15]  Yue Luo,et al.  Simulating Java Commercial Throughput Workload : A Case Study , 2005 .

[16]  Santosh G. Abraham,et al.  Chip multithreading: opportunities and challenges , 2005, 11th International Symposium on High-Performance Computer Architecture.

[17]  A. Alavi,et al.  Opportunities and Challenges , 1998, In Vitro Diagnostic Industry in China.

[18]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[19]  Witawas Srisa-an,et al.  Microphase: an approach to proactively invoking garbage collection for improved performance , 2007, OOPSLA.

[20]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[21]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[22]  Mikko H. Lipasti,et al.  Comparison of Memory System Behavior in Java and Non-Java Commercial Workloads , 2002 .

[23]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[24]  Michael Jones,et al.  Exploring Small-Scale and Large-Scale CMP Architectures for Commercial Java Servers , 2006, 2006 IEEE International Symposium on Workload Characterization.

[25]  Henry Lieberman,et al.  A real-time garbage collector based on the lifetimes of objects , 1983, CACM.

[26]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[27]  Isil Dillig,et al.  The CLOSER: automating resource management in java , 2008, ISMM '08.

[28]  Matthew Arnold,et al.  Jolt: lightweight dynamic analysis and removal of object churn , 2008, OOPSLA.

[29]  Ravi Iyer,et al.  Addressing Cache/Memory Overheads in Enterprise Java CMP Servers , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[30]  Lizy K. John,et al.  Workload Characterization of Java Server Applications on Two PowerPC Processors , 2002 .

[31]  Toshio Nakatani,et al.  Performance Studies of Commercial Workloads on a Multi-core System , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[32]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.