"MAMA!": a memory allocator for multithreaded architectures

While the high-performance computing world is dominated by distributed memory computer systems, applications that require random access into large shared data structures continue to motivate development of ever larger shared-memory parallel computers such as Cray's MTA and SGI's Altix systems.To support scalable application performance on such architectures, the memory allocator must be able to satisfy requests at a rate proportional to system size. For example, a 40 processor Cray MTA-2 can experience over 5000 concurrent requests, one from each of its 128 streams per processor. Cray's Eldorado, to be built upon the same network as Sandia's 10,000 processor Red Storm system, will sport thousands of multithreaded processors leading to hundreds of thousands of concurrent requests.In this paper, we present MAMA, a scalable shared-memory allocator designed to service any rate of concurrent requests. MAMA is distinguished from prior work on shared-memory allocators in that it employs software combining to aggregate requests serviced by a single heap structure: Hoard and MTA malloc necessitate repetition of the underlying heap data structures in proportion to processor count. Unlike Hoard, MAMA does not exploit processor-local data structures, limiting its applicability today to systems that sustain high utilization in the presence of global references such as Cray's MTA systems. We believe MAMA's relevance to other shared-memory systems will grow as they become increasingly multithreaded and, consequently, more tolerant of references to non-local memory.We show not only that MAMA scales on Cray MTA systems, but also that it delivers absolute performance competitive with allocators employing heap repetition. In addition, we demonstrate that performance of repetition-based allocators does not scale under heavy loads. We also argue more generally that methods using repetition alone to support concurrency are subject to an impractical tradeoff of scalability against space consumption: when scaled up to meet increasing concurrency demands, repetition-based allocators necessarily house unused space p2 quadratic in the number of processors p. Hierarchical structure may reduce this to p log p, but in building large-scale shared-memory parallel computers, unused memory more than linear in p is unacceptable. MAMA, in contrast, scales to arbitrarily large systems while consuming memory that increases only linearly with system and request size.MAMA is of both theoretical interest for its use of novel algorithmic techniques and practical importance as the concurrency upon which shared-memory performance depends continues to grow and multithreaded architectures emerge that are increasingly latency tolerant. While our work is a very recent contribution to memory allocation technology, MAMA already has been incorporated into production as the cornerstone for global memory allocation in Cray's multithreaded systems.