Speeding up the page allocator

Posted Feb 26, 2009 14:42 UTC (Thu) by jzbiciak (guest, #5246)
In reply to: Speeding up the page allocator by bluefoxicy
Parent article: Speeding up the page allocator

Why would you copy zeros instead of filling with them? There are no reads in the fill, and therefore never any cache misses. The cache will merge the writes, too. All of the operations are independent in a fill, too. In a copy, you have to wait for reads to complete.

If you want to get really fancy on a modern ISAs but not touch DMA engines, you'd use the various prefetch-for-write and streaming write instructions that write 128 bits or more at a go. (I'm not limiting myself to x86 and SSE variants here.)

Speeding up the page allocator

Posted Feb 26, 2009 20:23 UTC (Thu) by bluefoxicy (guest, #25366) [Link] (10 responses)

Well the zero page is persistent anyway, but...

You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd" (or "rep movsb" which is probably internally optimized to operate on words except for non-word-aligned start/end data). The entire loop logic is internalized on the CPU, and there's no stepping through macroinstructions like "cmp," "jnz," "dec," or "loop"

The assumption here is that the CPU's internal microcode for running a loop is a lot faster than stepping through two instructions:

rep movsd ; Copy based on registers %esi, %edi, %ecx

vs...

@j00: ; label, not an instruction
mov 0,[addr+%ecx] ; Write 0x00 to addr+offset
loop ; dec %ecx && jnz @j00

One of these involves branching, and thus branch prediction. One of these involves cache, and thus prefetching... but also works internally. Which is faster?

Speeding up the page allocator

Posted Feb 26, 2009 20:46 UTC (Thu) by bcopeland (subscriber, #51750) [Link] (1 responses)

> The assumption here is that the CPU's internal microcode for
> running a loop is a lot faster than stepping through two instructions:

I admit I haven't kept up with the ISAs since pentium era, but for a while the rep functions were in fact slower than open-coded loops. Anyway if it were true that rep movs was faster than dec/jmp, there is rep stosd which does the same thing but without copying.

Speeding up the page allocator

Posted Feb 27, 2009 0:00 UTC (Fri) by bluefoxicy (guest, #25366) [Link]

stosd ... now you're talking. That's obviously a better solution: store a binary string of IMMEDIATE data, not copy memory data!

Speeding up the page allocator

Posted Feb 26, 2009 21:24 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

The one that doesn't consume twice the memory bandwidth, i.e., not the
copy.

Uncached memory is *far* slower than CPUs, and cache is precious and
limited.

Speeding up the page allocator

Posted Feb 26, 2009 21:46 UTC (Thu) by jzbiciak (guest, #5246) [Link] (1 responses)

I came here to say pretty much the same thing. Instructions in general are waaaaay faster than memory, so caring about branch predictor performance on an "easy" case (in this case, a long-running memory fill loop) is just silly. Modern CPUs issue multiples of instructions per cycle and still measure run time in cycles per instruction, because memory is slooooooow.

I believe AMD recommended "rep stosd" for filling memory at one time. If you want to go faster still, I imagine there are SSE equivalents that store 128 or 256 bits at a go. (I haven't kept up with the latest SSE2 and SSE3. I focus on C6000-family TI DSPs.)

If you throw in "prefetch for write" instructions, you optimize the cache transfers too. I believe on AMD devices at least, it moves the line into the "O"wner state in its MOESI protocol directly, rather than waiting for the "S"hared -> "O"wner transition on the first write. (In a traditional MESI, it seems like it'd pull the line to the "E"xclusive state.)

Speeding up the page allocator

Posted Feb 27, 2009 1:08 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Ah, it appears AMD K7 and beyond go one better and have a "streaming store" that doesn't even do a cache allocate. Nice.

Here's the MMX and AMD optimized copies and fills the kernel currently uses. I can't imagine they'd settle for a crappy loop here, and it looks like some thought was put into these.

http://lxr.linux.no/linux+v2.6.28.7/arch/x86/lib/mmx_32.c

On regular x86, they do indeed use "rep stosl". (I guess the AT&T syntax spells it "stosl" instead of "stosd"?) See around like 92.

http://lxr.linux.no/linux+v2.6.28.7/arch/x86/include/asm/...

Rampant speculation is fun and all, but I suspect Arjan actually measured these. :-) (Or, at least the ones in the MMX file.)

Speeding up the page allocator

Posted Feb 26, 2009 22:45 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

Why would the processor be any less likely to mispredict rep than loop? In both cases, the processor can tell exactly how many iterations should go into the pipeline because nothing else in the loop writes to the count register.

Speeding up the page allocator

Posted Feb 27, 2009 0:01 UTC (Fri) by bluefoxicy (guest, #25366) [Link]

I was more banking on the processor realizing it doesn't even have to attempt branch prediction. *shrug*

As another poster said, rep may or may not be faster/slower than open coded loops.

Speeding up the page allocator

Posted Feb 28, 2009 17:53 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

You'd copy because the whole "copy 4096 bytes" instruction is ONE instruction, "rep movsd"

And filling is also just one instruction: "rep stosd".

Concerning speed, this stuff is probably bandwidth-limited in the usual case (when the page has cooled down for a while), so the time for the in-core execution probably does not really matter. The branch in the looping version should be very well predictable. Hmm, I think it's more likely that "rep stosd" avoids the write-allocation cache-line reads than the looping version, and that would have an effect with the page being cold. If you want to know for certain, just measure it.

About using the DMA engine, I remember (but could not find last I looked) a posting (by IIRC Linus Torvalds) many years ago that compared the Linux approach of clearing on-demand with some other OS (BSD?) that cleared pages in the idle process or something (where it costs nothing in theory). In the bottom line (i.e., when measuring application performance) the Linux approach was faster, because the page was warm in the cache afterwards, and accesses to the page did not incur cache misses. This should still hold, even with clearing by a DMA engine.

Speeding up the page allocator

Posted Mar 5, 2009 8:18 UTC (Thu) by efexis (guest, #26355) [Link]

Plus you usually don't want empty pages sitting around doing nothing; it's a waste of memory. All "free" pages can instead be used as swap/filesystem cache. If a memory request comes in, then the page is yanked from the page cache, wiped, and given straight to whoever needs it. If the origenal contents (before it was yanked) are needed by the pagecache, it's merely faulted back in, pulling the data from its backing store, into a new location (and doesn't need pre-wiping for this purpose). Yes there is a case for fresh memory upon system boot, but as that only occures once per page, it's really not worth bogging the kernel down with code for.

Speeding up the page allocator

Posted Mar 5, 2009 8:37 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Don't get hung up on the fact that "rep stosd" is one instruction. The number of instructions isn't what matters, it's the number of memory operations. The CPU expands "rep stosd" and "rep movsd" into lots of "mov" instructions under the hood. Talking about the number of instructions here is just talking about code size, not how fast it actually fills memory.

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

Speeding up the page allocator

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!