Memory Demands in Disaggregated HPC: How Accurate Do We Need to Be?

Disaggregated memory has recently been proposed as a way to allow flexible and fine-grained allocation of memory capacity, mitigating the mismatch between fixed per-node resource provisioning and the needs of the submitted jobs. By allowing the sharing of memory capacity among cluster nodes, overall HPC system throughput can be improved, due to the reduction of stranded and underutilized resources. A key parameter that is generally expected to be provided by the user at submission time is the job's memory capacity demand. It is unrealistic to expect this number to be precise. This paper makes an important step towards understanding the effect of overestimating the job memory requirements. We analyse the implications on overall system throughput and job response time. We leverage a disaggregated simulation infrastructure implemented on the popular Slurm resource manager. Our results show that even when the cost of a 60% increase in memory demands only increases a single job's user response time by 8%, the aggregate result of everybody doing so can be a 25% reduction in throughput and a 5 times increase in response time. These results show that GB-hours should be explicitly allocated in addition to core-hours.