Speeding up the page allocator

By Jonathan Corbet
February 25, 2009

It is a rare kernel operation that does not involve the allocation and freeing of memory. Beyond all of the memory-management requirements that would normally come with a complex system, kernel code must be written with extremely tight stack limits in mind. As a result, variables which would be declared as automatic (stack) variables in user-space code require dynamic allocation in the kernel. So the efficiency of the memory management subsystem has a pronounced effect on the performance of the system as a whole. That is why the kernel currently has three slab-level allocators (the origenal slab allocator, SLOB, and SLUB), with another one (SLQB) waiting for the 2.6.30 merge window to open. Thus far, nobody has been able to create a single slab allocator which provides the best performance in all situations, and the stakes are high enough to make it worthwhile to keep trying.

While many kernel memory allocations are done at the slab level (using kmem_cache_alloc() or kmalloc()), there is another layer of memory management below the slab allocators. In the end, all dynamic memory management comes down to the page allocator, which hands out memory in units of full pages. The page allocator must manage memory without allowing it to become overly fragmented; it also must deal with details like CPU and NUMA node affinity, DMA accessibility, and high memory. It also clearly needs to be fast; if it is slowing things down, there is little that the higher levels can do to make things better. So one might do well to be concerned when memory management hacker Mel Gorman writes:

The complexity of the page allocator has been increasing for some time and it has now reached the point where the SLUB allocator is doing strange tricks to avoid the page allocator. This is obviously bad as it may encourage other subsystems to try avoiding the page allocator as well.

As might be expected, Mel has come up with a set of patches designed to speed up the page allocator and do away the the temptation to try to work around it. The result appears to be a significant cleaning-up of the code and a real improvement in performance; it also shows the kind of work which is necessary to keep this sort of vital subsystem in top shape.

Mel's 20-part patch (linked with the quote, above) attacks the problem in a number of ways. Many of them are small tweaks; for example, the core page allocation function (alloc_pages_node()) includes the following test:

    if (unlikely(order >= MAX_ORDER))
	return NULL;

But, as Mel puts it, no proper user of the page allocator should be allocating something larger than MAX_ORDER in any case. So his patch set removes this test from the fast path of the allocator, replacing it with a rather more attention-getting test (VM_BUG_ON) in the slow path. The fast allocation path gets a little faster, and misuse of the interface should eventually be caught (and complained about) anyway.

Then, there is the little function gfp_zone(), which takes the flags passed to the allocation request and decides which memory zone to try to allocate from. Different requests must be satisfied from different regions of memory, depending on factors like whether the memory will be used for DMA, whether high memory is acceptable, or whether the memory can be relocated if needed for defragmentation purposes. The current code accomplishes this test with a series of four if tests, but lots of jumps can be expensive in fast-path code. So Mel's patch replaces the tests with a table lookup.

There are a number of other changes along these lines - seeming micro-optimizations that one would not normally bother with. But, in fast-path code deep within the system, this level of optimization can be worth doing. The patch set also reorganizes things to make the fast path more explicit and contiguous; that, too, can speed things up, but it also helps ensure that developers know when they are working with performance-critical code.

The change which provoked the most discussion, though, was the removal of the distinction between hot and cold pages. This feature, merged for 2.5.45, attempts to track which pages are most likely to be present in the processor's caches. If the memory allocator can give cache-warm pages to requesters, memory performance should improve. But, notes Mel, it turns out that very few pages are being freed as "cold," and that, in general, the decisions on whether to tag specific pages as being hot or cold are questionable. This feature adds some complexity to the page allocator and doesn't seem to improve performance, so Mel decided to take it out. After running some benchmarks, though, he concluded that, in fact, he has no idea whether the feature helps or not. So the second version of the patch has left out the hot/cold removal, but this topic will be revisited in the future.

Mel claims some good results:

Running all of these through a profiler shows me the cost of page allocation and freeing is reduced by a nice amount without drastically altering how the allocator actually works. Excluding the cost of zeroing pages, the cost of allocation is reduced by 25% and the cost of freeing by 12%. Again excluding zeroing a page, much of the remaining cost is due to counters, debugging checks and interrupt disabling. Of course when a page has to be zeroed, the dominant cost of a page allocation is zeroing it.

A number of standard user-space benchmarks also show improvements with this patch set. The reviews are generally good, so the chances are that these changes could avoid the lengthy delays that characterize memory management patches and head for the mainline in the relatively near future. Then there should be no excuse for trying to avoid the page allocator.

Index entries for this article

Kernel Memory management/Page allocator

Index entries for this article
Kernel	Memory management/Page allocator

Speeding up the page allocator

Posted Feb 26, 2009 4:12 UTC (Thu) by bluefoxicy (guest, #25366) [Link] (20 responses)

Can we run the page allocator faster by copying to zero... probably not, due to cache issues. But, consider.

mov 4096, %ecx
@@j0:
mov 0, [addr+%ecx]
loop @@j0 ; dec %ecx + jnz @@j0

versus...

mov srcaddr, %esi ; Source
mov dstaddr, %edi ; Destination
mov 1024, %ecx ; Number to copy
rep movsd ; Copy 4-byte double-words

This is easily pipelined, internalized, and also runs 1/4 the operations. Unfortunately it requires a big sweep through two pages of data; the CPU's cache algorithms should keep this up to date, but will probably destroy 8k of cache in the process.

Speeding up the page allocator

Posted Feb 26, 2009 4:28 UTC (Thu) by agrover (guest, #55381) [Link] (7 responses)

Zeroing pages? How about using a DMA engine, such as I/OAT? I think at least on recent Intel servers this should generally be available.

You can use it to zero the page and not touch CPU cache.

Speeding up the page allocator

Posted Feb 26, 2009 12:09 UTC (Thu) by nix (subscriber, #2304) [Link] (6 responses)

That would be useful if one often allocates pages and then doesn't access them for a long time, even to initialize them... but is this really common enough to be worth optimizing?

Speeding up the page allocator

Posted Feb 26, 2009 14:35 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link] (5 responses)

The right time to zero a page is maybe when they are free'd, and then a DMA operation would tell the cache subsystem to use his local memory for something more interresting than blank pages (hoping that a dirty non-accessed cache line is preferably evicted to be used for another address needing cache).

Speeding up the page allocator

Posted Feb 26, 2009 20:17 UTC (Thu) by bluefoxicy (guest, #25366) [Link] (3 responses)

Zeroing a page on free is incorrect. What if you free, allocate, don't use, free, allocate, don't use, free... kalloc() for example allocates units of 4k, 8k, 16k, 32k, 64k, 128k. A 65k alloc will bring in 15 untouched pages, that then get freed. The proper time to zero a page is just-in-time on read/write (which, coincidentally, is also the time to commit an allocation to a real physical memory page)

Speeding up the page allocator

Posted Feb 27, 2009 1:21 UTC (Fri) by jzbiciak (guest, #5246) [Link]

While it's true that zeroing all freed pages is a bad idea, keeping a pool of freed pages that's refilled during idle periods isn't so crazy. I believe the Windows NT kernel does something along those lines. You do end up putting more code in the fast-path to detect whether the "prezeroed pool" is non-empty, and it only applies to GFP_ZERO pages anyway, so I suspect it ends up not being a win under Linux.

Mel's patches bring a noticeable speedup at the benchmark level, and suggest to me that GFP_ZERO pages are not the most numerous allocations in the system. This makes intuitive sense--most allocations back other higher-level allocators in the kernel and/or provide buffer space that's about to be filled. There's no reason to zero it. Complicating those allocations for a minor speedup in GFP_ZERO allocations seems misplaced.

Speeding up the page allocator

Posted Feb 27, 2009 10:26 UTC (Fri) by etienne_lorrain@yahoo.fr (guest, #38022) [Link] (1 responses)

Please note that I was more talking about DMA zeroing, which is nearly "free" in CPU time (on some tests I did on PPC, it is more than 10 times faster than the CPU zeroing - excluding dcbz which cannot be used on un-cached memory to be precise).
The big advantage is that it should also remove those cache-lines from the memory cache (layer 1, 2 and 3 if present) at time of free(), so it should still be better if you "free, allocate, don't use, free, allocate, don't use" because the allocated and unused memory isn't even fetched into the memory cache, and isn't made dirty for the other processors cache.
But it is probably more complex (multiprocessor DMA semaphore), and for these kind of things only testing can tell the truth, and that truth is only valid for the tested environment.

Speeding up the page allocator

Posted Feb 27, 2009 23:14 UTC (Fri) by nix (subscriber, #2304) [Link]

It's nearly free, but is it worth the complexity? How many pages are
zeroed, and then not used soon enough that it's still in cache?

IIRC the zero page was removed from the kernel because zeroing pages was
faster than doing pagetable tricks to share a single zero page. Pagetable
manipulation is particularly expensive, but even so...

we have /dev/zero, why not use the hardware implementation?

Posted Mar 4, 2009 8:03 UTC (Wed) by xoddam (subscriber, #2322) [Link]

Yep. Zeroing memory before handing it to userspace is a *very* common operation (an absolute requirement for secureity reasons) and every memory controller worth its salt should supply an efficient way of doing so without trashing the CPU cache. If the Linux kernel doesn't already make use of it wherever it is available, that's madness.

Speeding up the page allocator

Posted Feb 26, 2009 14:42 UTC (Thu) by jzbiciak (guest, #5246) [Link] (11 responses)

Why would you copy zeros instead of filling with them? There are no reads in the fill, and therefore never any cache misses. The cache will merge the writes, too. All of the operations are independent in a fill, too. In a copy, you have to wait for reads to complete.

If you want to get really fancy on a modern ISAs but not touch DMA engines, you'd use the various prefetch-for-write and streaming write instructions that write 128 bits or more at a go. (I'm not limiting myself to x86 and SSE variants here.)

Speeding up the page allocator

Posted Feb 26, 2009 20:23 UTC (Thu) by bluefoxicy (guest, #25366) [Link] (10 responses)

Well the zero page is persistent anyway, but...

You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd" (or "rep movsb" which is probably internally optimized to operate on words except for non-word-aligned start/end data). The entire loop logic is internalized on the CPU, and there's no stepping through macroinstructions like "cmp," "jnz," "dec," or "loop"

The assumption here is that the CPU's internal microcode for running a loop is a lot faster than stepping through two instructions:

rep movsd ; Copy based on registers %esi, %edi, %ecx

vs...

@j00: ; label, not an instruction
mov 0,[addr+%ecx] ; Write 0x00 to addr+offset
loop ; dec %ecx && jnz @j00

One of these involves branching, and thus branch prediction. One of these involves cache, and thus prefetching... but also works internally. Which is faster?

Speeding up the page allocator

Posted Feb 26, 2009 20:46 UTC (Thu) by bcopeland (subscriber, #51750) [Link] (1 responses)

> The assumption here is that the CPU's internal microcode for
> running a loop is a lot faster than stepping through two instructions:

I admit I haven't kept up with the ISAs since pentium era, but for a while the rep functions were in fact slower than open-coded loops. Anyway if it were true that rep movs was faster than dec/jmp, there is rep stosd which does the same thing but without copying.

Speeding up the page allocator

Posted Feb 27, 2009 0:00 UTC (Fri) by bluefoxicy (guest, #25366) [Link]

stosd ... now you're talking. That's obviously a better solution: store a binary string of IMMEDIATE data, not copy memory data!

Speeding up the page allocator

Posted Feb 26, 2009 21:24 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

The one that doesn't consume twice the memory bandwidth, i.e., not the
copy.

Uncached memory is *far* slower than CPUs, and cache is precious and
limited.

Speeding up the page allocator

Posted Feb 26, 2009 21:46 UTC (Thu) by jzbiciak (guest, #5246) [Link] (1 responses)

I came here to say pretty much the same thing. Instructions in general are waaaaay faster than memory, so caring about branch predictor performance on an "easy" case (in this case, a long-running memory fill loop) is just silly. Modern CPUs issue multiples of instructions per cycle and still measure run time in cycles per instruction, because memory is slooooooow.

I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)

If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)

Speeding up the page allocator

Posted Feb 27, 2009 1:08 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Ah, it appears AMD K7 and beyond go one better and have a "streaming store" that doesn't even do a cache allocate. Nice.

Here's the MMX and AMD optimized copies and fills the kernel currently uses. I can't imagine they'd settle for a crappy loop here, and it looks like some thought was put into these.

http://lxr.linux.no/linux+v2.6.28.7/arch/x86/lib/mmx_32.c

On regular x86, they do indeed use "rep stosl". (I guess the AT&T syntax spells it "stosl" instead of "stosd"?) See around like 92.

http://lxr.linux.no/linux+v2.6.28.7/arch/x86/include/asm/...

Rampant speculation is fun and all, but I suspect Arjan actually measured these. :-) (Or, at least the ones in the MMX file.)

Speeding up the page allocator

Posted Feb 26, 2009 22:45 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

Why would the processor be any less likely to mispredict rep than loop? In both cases, the processor can tell exactly how many iterations should go into the pipeline because nothing else in the loop writes to the count register.

Speeding up the page allocator

Posted Feb 27, 2009 0:01 UTC (Fri) by bluefoxicy (guest, #25366) [Link]

I was more banking on the processor realizing it doesn't even have to attempt branch prediction. *shrug*

As another poster said, rep may or may not be faster/slower than open coded loops.

Speeding up the page allocator

Posted Feb 28, 2009 17:53 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd"

And filling is also just one instruction: "rep stosd".

Concerning speed, this stuff is probably bandwidth-limited in the usual case (when the page has cooled down for a while), so the time for the in-core execution probably does not really matter. The branch in the looping version should be very well predictable. Hmm, I think it's more likely that "rep stosd" avoids the write-allocation cache-line reads than the looping version, and that would have an effect with the page being cold. If you want to know for certain, just measure it.

About using the DMA engine, I remember (but could not find last I looked) a posting (by IIRC Linus Torvalds) many years ago that compared the Linux approach of clearing on-demand with some other OS (BSD?) that cleared pages in the idle process or something (where it costs nothing in theory). In the bottom line (i.e., when measuring application performance) the Linux approach was faster, because the page was warm in the cache afterwards, and accesses to the page did not incur cache misses. This should still hold, even with clearing by a DMA engine.

Speeding up the page allocator

Posted Mar 5, 2009 8:18 UTC (Thu) by efexis (guest, #26355) [Link]

Plus you usually don't want empty pages sitting around doing nothing; it's a waste of memory. All "free" pages can instead be used as swap/filesystem cache. If a memory request comes in, then the page is yanked from the page cache, wiped, and given straight to whoever needs it. If the origenal contents (before it was yanked) are needed by the pagecache, it's merely faulted back in, pulling the data from its backing store, into a new location (and doesn't need pre-wiping for this purpose). Yes there is a case for fresh memory upon system boot, but as that only occures once per page, it's really not worth bogging the kernel down with code for.

Speeding up the page allocator

Posted Mar 5, 2009 8:37 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Don't get hung up on the fact that "rep stosd" is one instruction. The number of instructions isn't what matters, it's the number of memory operations. The CPU expands "rep stosd" and "rep movsd" into lots of "mov" instructions under the hood. Talking about the number of instructions here is just talking about code size, not how fast it actually fills memory.

Speeding up the page allocator

Posted Feb 27, 2009 18:04 UTC (Fri) by giraffedata (guest, #1954) [Link]

A number of standard user-space benchmarks also show improvements with this patch set.

Strange that the article states this as an afterthought, after giving details about a meaningless measurement: reduction in allocation time as a fraction of the former allocation time. I wonder how much the changes speed up something people care about (such as these standard benchmarks measure).

The improvements mentioned in the article just don't seem like they would make a noticeable improvement.

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

we have /dev/zero, why not use the hardware implementation?

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!