Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, June 25-30, 2001, Boston, Massachusetts, USA

While the virtual memory management in Linux 2.2 has decent performance for many workloads, it suffers from a number of problems. The first part of this paper contains a description of how the Linux 2.2 VMM works and an analysis of why it has bad behaviour in some situations. The way in which a lot of this behaviour has been fixed in the Linux 2.4 kernel is described in the second part of the paper. Due to Linux 2.4 being in a code freeze period while these improvements were implemented, only known-good solutions have been integrated. A lot of the ideas used are derived from principles used in other operating systems, mostly because we have certainty that they work and a good understanding of why, making them suitable for integration into the Linux codebase during a code freeze. 1 Linux 2.2 memory management The memory management in the Linux 2.2 kernel seems to be focussed on simplicity and low overhead. While this works pretty well in practice for most systems, it has some weak points left and simply falls apart under some scenarios. Memory in Linux is unified, that is all the physical memory is on the same free list and can be allocated to any of the following memory pools on demand. Most of these pools can grow and shrink on demand. Typically most of a system’s memory will be allocated to the data pages of processes and the page and buffer caches. • The slab cache: this is the kernel’s dynamically allocated heap storage. This memory is unswappable, but once all objects within one (usually page-sized) area are unused, that area can be reclaimed. • The page cache: this cache is used to cache file data for both mmap() and read() and is indexed by (inode, index) pairs. No dirty data exists in this cache; whenever a program writes to a page, the dirty data is copied to the buffer cache, from where the data is written back to disk. • The buffer cache: this cache is indexed by (block device, block number) tuples and is used to cache raw disk devices, inodes, directories and other filesystem metadata. It is also used to perform disk IO on behalf of the page cache and the other caches. For disk reads the pagecache bypasses this cache and for network filesystems it isn’t used at all. • The inode cache: this cache resides in the slab cache and contains information about cached files in the system. Linux 2.2 cannot shrink this cache, but because of its limited size it does need to reclaim individual entries. • The dentry cache: this cache contains directory and name information in a filesystemindependent way and is used to lookup files and directories. This cache is dynamically grown and shrunk on demand. • SYSV shared memory: the memory pool containing the SYSV shared memory segments is managed pretty much like the page cache, but has its own infrastructure for doing things. • Process mapped virtual memory: this memory is administrated in the process page tables. Processes can have page cache or SYSV shared memory segments mapped, in which case those pages are managed in both the page tables and the data structures used for respectively the page cache or the shared memory code. 1.1 Linux 2.2 page replacement The page replacement of Linux 2.2 works as follows. When free memory drops below a certain threshold, the pageout daemon (kswapd) is woken up. The pageout daemon should usually be able to keep enough free memory, but if it isn’t, user programs will end up calling the pageout code itself. The main pageout loop is in the function try to free pages, which starts by freeing unused slabs from the kernel memory pool. After that, it calls the following functions in a loop, asking each of them to scan a small part of their part of memory until enough memory has been freed. • shrink mmap is a classical clock algorithm, which loops over all physical pages, clearing referenced bits, queueing old dirty pages pages for IO and freeing old clean pages. The main disadvantage it has compared to a clock algorithm, however, is that it isn’t able to free pages which are in use by a program or a shared memory segment. Those pages need to be unmapped by swap out first. • shm swap scans the SYSV shared memory segments, swapping out those pages that haven’t been referenced recently and which aren’t mapped into any process. • swap out scans the virtual memory of all processes in the system, unmapping pages which haven’t been referenced recently, starting swapout IO and placing those pages in the page cache. • shrink dcache memory recaims entries from the VFS name cache. This is not directly reusable memory, but as soon as a whole page of these entries gets unused we can reclaim that page. Some balancing between these memory freeing function is achieved by calling them in a loop, starting of by asking each of these functions to scan a little bit of their memory, as each of these funnctions accepts a priority argument which tells them how big a percentage of their memory to scan. If not enough memory is freed in the first loop, the priority is increased and the functions are called again. The idea behind this scheme is that when one memory pool is heavily used, it will not give up its resources lightly and we’ll automatically fall through to one of the other memory pools. However, this scheme relies on each of the memory pools to react in a similar way to the priority argument under different load conditions. This doesn’t work out in practice because the memory pools just have fundamentally different properties to begin with. 1.2 Problems with the Linux 2.2 page replacement • Balancing between evicting pages from the file cache, evicting unused process pages and evicting pages from shm segments. If memory pressure is ”just right” shrink mmap is always successful in freeing cache pages and a process which has been idle for a day is still in memory. This can even happen on a system with a fairly busy filesystem cache, but only with the right phase of moon. • Simple NRU[Note] replacement cannot accurately identify the working set versus incidentally accessed pages and can lead to extra page faults. This doesn’t hurt noticably for most workloads, but it makes a big difference in some workloads and can be fixed easily, mostly since the LFU replacement used in older Linux kernels is known to work. • Due to the simple clock algorithm in shrink mmap, sometimes clean, accessed pages can get evicted before dirty, old pages. With a relatively small file cache that mostly consists of dirty data, eg unpacking a tarball, it is possible for the dirty pages to evict the (clean) metadata buffers that are needed to write the dirty data to disk. A few other corner cases with amusing variations on this theme are bound to exist. • The system reacts badly to variable VM load or to load spikes after a period of no VM activity. Since kswapd, the pageout daemon, only scans when the system is low on memory, the system can end up in a state where some pages have referenced bits from the last 5 seconds, while other pages have referenced bits from 20 minutes ago. This means that on a load spike the system has no clue which are the right pages to evict from memory, this can lead to a swapping storm, where the wrong pages are evicted and almost immediately afterwards faulted back in, leading to the pageout of another random page, etc... • Under very heavy loads, NRU replacement of pages simply doesn’t cut it. More careful and better balanced pageout eviction and flushing is called for. With the fragility of the Linux 2.2 pageout framework this goal doesn’t really seem achievable. The facts that shrink mmap is a simple clock algorithm and relies on other functions to make process-mapped pages freeable makes it fairly unpredictable. Add to that the balancing loop in try to free pages and you get a VM subsystem which is extremely sensitive to minute changes in the code and a fragile beast at its best when it comes to maintenance or (shudder) tweaking. 2 Changes in Linux 2.4 For Linux 2.4 a substantial development effort has gone into things like making the VM subsystem fully fine-grained for SMP systems and supporting machines with more than 1GB of RAM. Changes to the pageout code were done only in the last phase of development and are, because of that, somewhat conservative in nature and only employ known-good methods to deal with the problems that happened in the page replacement of the Linux 2.2 kernel. Before we get to the page replacement changes, however, first a short overview of the other changes in the 2.4 VM: • More fine-grained SMP locking. The scalability of the VM subsystem has improved a lot for workloads where multiple CPUs are reading or writing the same file simultaneously; for example web or ftp server workloads. This has no real influence on the page replacement code. • Unification of the buffer cache and the page cache. While in Linux 2.2 the page cache used the buffer cache to write back its data, needing an extra copy of the data and doubling memory requirements for some write loads, in Linux 2.4 dirty page cache pages are simply added in both the buffer and the page cache. The system does disk IO directly to and from the page cache page. That the buffer cache is still maintained separately for filesystem metadata and the caching of raw block devices. Note that the cache was already unified for reads in Linux 2.2, Linux 2.4 just completes the unification. • Support for systems with up to 64GB of RAM (on x86). The Linux kernel previously had all physical memory directly mapped in the kernel’s virtual address space, which limited the amount of supported memory to slightly under 1GB. For Linux 2.4 the kernel also supports additional memory (so called ”high memory” or highmem), which can not be used for kernel data structures but only for page cache and user process memory. To do IO on these pages they are temporarily mapped into kernel virtual memory and the data is copied to or from a bounce buffer in ”low memory”. At the same time the memory zone for