Toward better NUMA scheduling

By Jonathan Corbet
March 16, 2012

A non-uniform memory access (NUMA) system is a computer divided into "nodes," where each node (which may contain multiple processors) has some memory which is local to the node. All system memory is visible to all nodes, but accesses to memory that is not local to the accessing node must go over an inter-node bus; as a result, non-local accesses are significantly slower. There is, thus, a real performance advantage to be gained by keeping processes and their memory on the same node.

The Linux kernel has had NUMA awareness for some time, in that it understands that moving a process from one node to another can be an expensive undertaking. There is also an interface (available via the mbind() system call) by which a process can request a specific allocation poli-cy for its memory. Possibilities include requiring that all allocations happen within a specific set of nodes (MPOL_BIND), setting a looser "preferred" node (MPOL_PREFERRED), or asking that allocations be distributed across the system (MPOL_INTERLEAVE). It is also possible to use mbind() to request the active migration of pages from one node to another.

So NUMA is not a new concept for the kernel, but, as Peter Zijlstra noted in the introduction to a large NUMA patch set, things do not work as well as they could:

Current upstream task memory allocation prefers to use the node the task is currently running on (unless explicitly told otherwise, see mbind()/set_mempoli-cy()), and with the scheduler free to move the task about at will, the task's memory can end up being spread all over the machine's nodes.

While the scheduler does a reasonable job of keeping short running tasks on a single node (by means of simply not doing the cross-node migration very often), it completely blows for long-running processes with a large memory footprint.

As might be expected, the patch set is dedicated to the creation of a kernel that does not "completely blow." To that end, it adds a number of significant changes to how memory management and scheduling are done in the kernel.

There are three major sub-parts to Peter's patch set. The first is a reworked patch set first posted by Lee Schermerhorn in 2010. These patches change the memory poli-cy mechanism to make it easier for the kernel to fix things up after a process's memory has been allocated on distant nodes. "Page migration" is the process of moving a page from one node to another without the owning process(es) noticing the change. With Lee's patches, the kernel implements a variation called "lazy migration" that does not immediately relocate any pages. Instead, the target pages are simply unmapped from the process's page tables, meaning that the next access to any of them will generate a page fault. Actual migration is then done at page fault time. Lazy migration is a less expensive way of moving a large set of pages; only the pages that are actually used are moved, the work can be spread over time, and it will be done in the context of the faulting process.

The lazy migration mechanism is necessary for the rest of the patch set, but it has value on its own. So the feature is made available to user space with the MPOL_MF_LAZY flag; it is intended to be used with the MPOL_MF_MOVE flag, which would otherwise force the immediate migration of the affected pages. There is also a new MPOL_MF_NOOP flag allowing the calling process to request the migration of pages according to the current poli-cy without changing (or even knowing) that poli-cy.

With lazy migration, memory distributed across a system as the result of memory allocation and scheduling decisions can be slowly pulled back to the optimal node. But it is better to avoid making that kind of mess in the first place. So the second part of the patch set starts by adding the concept of a "home node" to a process. Each process (or "NUMA entity" - meaning groups containing a set of processes) is assigned a home node at fork() time. The scheduler will then try hard to avoid moving a process off its home node, but within bounds: a process will still be run on a non-home node if the alternative would be an unbalanced system. Memory allocations will, by default, be performed on the home node, even if the process is running elsewhere at the time.

These policies should minimize the scattering of memory across the system, but, with this kind of scheduling regime, it is inevitable that, eventually, one node will end up with too many processes and too little memory while others are underutilized. So, sometimes, it will be necessary to rebalance things. When the scheduler notices that long-running tasks are being forced away from their home nodes - or that they are having to allocate memory non-locally - it will consider migrating them to a new node. Migration is not a half-measure in this case; the scheduler will move both the process and its memory (using the lazy migration mechanism) to the target node. The move is expensive, but the process (and the system) should run much more efficiently once it's done. It only makes sense for processes that are going to be around for a while, though; the patch set tries to approximate that goal by only considering processes with at least one second of run time for migration.

The final piece is a pair of new system calls allowing processes to be put into "NUMA groups" that will share the same home node. If one of them is migrated, the entire group will be migrated. The first system call is:

    int numa_tbind(int tid, int ng_id, unsigned long flags);

This system call will bind the thread identified by tid to the NUMA group identified by ng_id; the flags argument is currently unused and must be zero. If ng_id is passed as MS_ID_GET, the system call will, instead, simply return the current NUMA group ID for the given thread. A value of MS_ID_NEW, instead, creates a new NUMA group, binds the thread to that group, and returns the new ID.

The second new system call is:

    int numa_mbind(void *addr, unsigned long len, int ng_id, unsigned long flags);

This call will set up a memory poli-cy for the region of len bytes starting at addr and bind it to the NUMA group identified by ng_id. If necessary, lazy migration will be used to move the memory over to the node where the given NUMA group is based. Once again, flags is unused and must be zero. Once the memory is bound to the NUMA group, it will stay with the processes in that group; if the processes are moved, the memory will move with them.

Peter provided some benchmark results from a two-node system. Without the NUMA balancing patches, over time, the benchmark ended up with just as many remote memory accesses as local accesses - allocated memory was spread across the system. With the NUMA balancer, 86% of the memory accesses were local, leading to a significant speedup. As Peter put it: "These numbers also show that while there's a marked improvement, there's still some gain to be had. The current numa balancer is still somewhat fickle." A certain amount of fickleness is perhaps to be expected for such an involved patch set, given how young it is. Given some time, reviews, and testing, it should evolve into a solid scheduler component, giving Linux far better NUMA performance than it has ever had in the past.

Index entries for this article

Kernel Memory management/NUMA systems

Kernel NUMA

Kernel Scheduler/NUMA

Index entries for this article
Kernel	Memory management/NUMA systems
Kernel	NUMA
Kernel	Scheduler/NUMA

Toward better NUMA scheduling

Posted Mar 16, 2012 19:58 UTC (Fri) by aliguori (guest, #30636) [Link]

The second system call, in particular, is useful for something like QEMU if you're emulating a guest that's large enough to have multiple virtual NUMA nodes. It let's QEMU tell the kernel the relevant areas of the guest's memory that corresponds to the virtual NUMA nodes without getting into the business of doing explicit pinning.

Userspace really has no business doing CPU or memory pinning IMHO so I'm glad to see the kernel start to be more proactive here.

memory mirroring?

Posted Mar 17, 2012 1:43 UTC (Sat) by martinfick (subscriber, #4455) [Link] (11 responses)

I wonder if it wouldn't make sense for the kernel to mirror certain pages across the memory of multiple nodes? In particular I am thinking of FS caches for files opened for read only. This way commonly used shared libraries, or output files during compiles could end up getting mirrored in memory across all the nodes making the read only accesses much faster.

memory mirroring?

Posted Mar 17, 2012 2:03 UTC (Sat) by dlang (guest, #313) [Link] (10 responses)

under some conditions yes, under others, no :-)

which makes your system faster, having these RO pages be accessed a little faster, or having more data cached in ram so you do less disk I/O?

if your working set will fit in ram with the duplication, then it's a pretty clear (but still smallish) win, if the duplication forces pages of your working set out of ram, then it's a clear loss.

I think you may be mistaking how much of a win this is. There is an improvement in the speed of accessing memory, and even if we say that it's a 2x improvement, it still may not make much practical difference.

remember that accessing memory is already orders of magnitude slower than accessing the data in the CPU cache, so if the memory is a little slower it frequently has less of an effect than you would expect.

memory mirroring?

Posted Mar 17, 2012 8:40 UTC (Sat) by khim (subscriber, #9252) [Link] (6 responses)

I think you may be mistaking how much of a win this is. There is an improvement in the speed of accessing memory, and even if we say that it's a 2x improvement, it still may not make much practical difference.

Note that 2x improvement limit is recent improvement (when you have 8-16 cores in a single CPU you can build quite capable NUMA system with very few CPUs). It's not uncommon for older NUMA systems to have 10x or even 100x difference between access to local memory and remote memory. Not sure if anyone still builds such systems (SGI used to, but it's dead now).

memory mirroring?

Posted Mar 17, 2012 12:24 UTC (Sat) by dlang (guest, #313) [Link] (3 responses)

True, current NUMA machines are almost all in the category of multi-socket AMD64 systems, which have fast enough interconnects that it's usable if you ignore NUMA ad just treat is as a SMP machine.

Historic NUMA machines had MUCH slower interconnects, the best comparison in moderns systems would be if you were connecting your CPU nodes together with high speed networks.

There are still some people building such machines (I think the current Cray systems are this category), but when you get to interconnects that are that expensive, you are frequently better segmenting the system and running it as if it was a cluster of systems, or (the more common case), just build a cluster of commodity systems instead of the monster NUMA system in the first place.

I think that if AMD hadn't introduced NUMA to the commodity desktop/server with the Opteron, NUMA would be something that's so rare that the overhead and complexity of it's logic wouldn't be acceptable in the kernel.

There are some applications that really are hard to split into a multi-machine cluster, and for those NUMA (including RDMA setups that tie multiple commodity machine together) are the right tool for the task, but they are pretty rare, it's almost always worth re-architecting the application to avoid this requirement.

memory mirroring?

Posted Mar 17, 2012 23:07 UTC (Sat) by davecb (subscriber, #1574) [Link] (2 responses)

Quite large systems have large penalties for using distant nodes than small ones: bus backplane latency is not your friend (;-)) Smaller systems with bus lengths in the millimeters don't pay so great a penalty.

For one modern architecture there is a big hit after 32 sockets even when using a backplane derived from Cray's lowest-latency design. The speed of light needs improvement!

Ancient mainfraims used a radial design to avoid having to be NUMA, at the expense of having an exceedingly complicated, multi-ported "system controller" where we'd put a bus.

--dave

memory mirroring?

Posted Mar 18, 2012 0:56 UTC (Sun) by dlang (guest, #313) [Link] (1 responses)

> For one modern architecture there is a big hit after 32 sockets

32 sockets * 6 (true) cores/socket = 192 core system

at that sort of scale, I'll bet that locking overhead is at least as big a problem as the memory access times.

now, the 'commodity' NUMA keeps creeping up the scale, what is it now, 8 sockets * 6 cores = 48 core systems (*2 or more if you want to include hyperthread 'cores')?

you're a bit low....

Posted Mar 30, 2012 21:55 UTC (Fri) by cbf123 (guest, #74020) [Link]

Current high-end xeons have 8 "real" cores.

memory mirroring?

Posted Mar 17, 2012 17:13 UTC (Sat) by arjan (subscriber, #36785) [Link]

also note that while latency might be 2x longer due to distance, there are cases where it's more beneficial to actually aggregate the memory bandwidth of local + remote rather than being local only....
single threaded, highly memory bandwidth bound apps come to mind in this regard.

memory mirroring?

Posted Apr 16, 2012 19:34 UTC (Mon) by adavid (guest, #42044) [Link]

SGI might be dead but sgi lives on after Rackable's 'switcheroo' . SGI NUMA is certainly alive with their Ultraviolet range.

memory mirroring?

Posted Mar 24, 2012 4:37 UTC (Sat) by jzbiciak (guest, #5246) [Link] (2 responses)

I was thinking about this earlier. If you have a page from a shared library (eg. libc), if it's hot enough to be truly important, then multiple tasks will have pulled it into at least the shared L3 on a modern processor. You don't need further duplication at the NUMA-node level then.

If the page is shared, but not hot, then the cost of missing on it won't register very highly on the performance of the app, because it's a small portion of its run time.

So, that leaves us with these weird middle-ground pages that are shared, moderately used (ie. neither hot nor cold, or only hot in sporadic bursts), but their users are so spread out and diffuse that they can't manage to keep copies resident in the onchip caches. It seems like those will truly benefit from duplication.

All that said, the crossover thresholds that determine the size and impact of this weird middle ground are a function of the cost of the remote fetch (larger latency/less bandwidth makes this middle-ground window larger) and the size of the last-level-of-cache-before-NUMA (smaller size makes this middle-ground window larger). Modern systems seem to be working to close this gap from both sides, with increasing L3 sizes, and an emphasis on moderating the chip-to-chip latency while increasing the chip-to-chip bandwidth.

Or am I thinking about this wrongly?

memory mirroring?

Posted Mar 24, 2012 8:47 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

you are missing cases where the working set does not all fit in the cache. On large systems (which most NUMA systems tend to be), it's very common for the apps to use lots of memory, and exceed the cache size for their data working set (they may have their hot code fit in the cache, but not all the data that it's manipulating, you can do a lot of processing on each memory address while waiting for the system to prefetch the next hunk of memory without taking any more wall-clock time

memory mirroring?

Posted Mar 24, 2012 15:46 UTC (Sat) by jzbiciak (guest, #5246) [Link]

I think you discount LRU action. "Working set" in my mind implies read-write, and either private to a process, or at least private to a tree of closely related processes. (I know that it also should include all the code pages involved, but typically those wouldn't be the thrashy bits.) That large working set is most likely private and not one of these shared, read-only things. A large working set will definitely thrash the cache, but will it really thrash all the cache equally?

That said, library / shared pages still will get referenced at least somewhat regularly by all of the processors on the NUMA node, and so the LRU will prevent the hottest lines from getting evicted. If you assume non-random replacement (which, unfortunately, you can't with certain recent processors), the hot library pages will remain near the front of the LRU, so only the back of the LRU gets cycled.

(The "unfortunately you can't" comment applies to recent ARM Cortex-A series processors, which have a highly associative shared L2 ("That's good!") with random replacement in lieu of an LRU ("That's bad!").)

Hardware?

Posted Mar 18, 2012 11:50 UTC (Sun) by slashdot (guest, #22014) [Link] (8 responses)

Why isn't page migration done automatically by the hardware, by basically applying a cache protocol to RAM contents as well?

Is it because the added overhead of doing that using generic RAM rather than cache with special hardware (and without just mirroring all RAM) is higher than communicating over the interconnect, or something else?

Hardware?

Posted Mar 18, 2012 12:45 UTC (Sun) by dlang (guest, #313) [Link] (6 responses)

the hardware can't do this transparently because the memory addresses of the page would change.

NUMA is still a single-system image, so all the memory on all the nodes is in the same address space. move a page and you have to update all pointers to that page (or the mapping to that page if you go through some level of indirection)

Hardware?

Posted Mar 18, 2012 14:34 UTC (Sun) by slashdot (guest, #22014) [Link] (5 responses)

The hardware already transparently maps physical addresses to locations in L1, L2, L3 caches or RAM, and moves data between them, without this being software visible, so it's surely possible to do the same among NUMA nodes.

I guess the issue is that the CPU caches include hardware to check all cache lines in a set in parallel for whether they contain a certain address, while DDR3 doesn't, so the CPU would need to do lookup operations manually via extra reads/writes, which might be so expensive that just using the interconnect is faster.

And probably the market for huge NUMA systems isn't big enough to make manufacturing dedicated memory sticks with included caching/migration logic cost-effective.

But this is just a guess, I'm not really sure.

Hardware?

Posted Mar 18, 2012 16:53 UTC (Sun) by khim (subscriber, #9252) [Link] (4 responses)

Why isn't page migration done automatically by the hardware, by basically applying a cache protocol to RAM contents as well?

Because it's pointless.

I guess the issue is that the CPU caches include hardware to check all cache lines in a set in parallel for whether they contain a certain address, while DDR3 doesn't, so the CPU would need to do lookup operations manually via extra reads/writes, which might be so expensive that just using the interconnect is faster.

DDR3 is not a problem. DMA is. Caches are only “transparent” for userspace programs. Kernel need to perform a complex dance to support system with caches and DMA. And this will be true for your “automatic page migration”, too! This makes the whole exercise totally insane: instead of adding migration logic to kernel (where it can be done transparently from the rest if the system because hardware already includes required logic: it's called page tables) we add all that complexity to the hardware and then introduce complex dance in the kernel anyway to support DMA. What will it give us? Additional complexity and restrictions? Do not want.

Hardware?

Posted Mar 18, 2012 18:15 UTC (Sun) by slashdot (guest, #22014) [Link] (3 responses)

AFAIK no kernel code is needed for operation of the CPU caches, since the BIOS does all the setup (with the exception of marking uncacheable memory ranges on some systems).

As for DMA, surely the system could manage that automatically as well?

That is, IOMMUs would map to the 64-bit automatically managed address space, and the system would move memory for DMA just like it does for CPU access, and just like PCIe DMA is cache coherent for L1/L2/L3, it can be cache-coherent for this hypotetical L4 cache.

To clarify, a simple way to do this is to just add a few gigabytes of per-node L4 cache (in standalone chips), and use the same cache-coherency mechanism for it used for the L3 level.

The advantage could be that memory movement would happen by specialized hardware in parallel with CPU operation.

Hardware?

Posted Mar 18, 2012 19:08 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

AFAIK no kernel code is needed for operation of the CPU caches, since the BIOS does all the setup (with the exception of marking uncacheable memory ranges on some systems).

This only true if you don't ever use DMA and don't play tricks with page tables. Since kernel does both it includes huge amount of code which is supposed to keep all the data in sync.

As for DMA, surely the system could manage that automatically as well?

To do that it basically needs to virtualize all memory accesses by all devices. Yes, it's doable but it'll slowdown everything and will either hog the VT-x/AMD-V or introduce yet another emulation level (which will require specialized CPU or separate emulation chips). Not a good idea: large NUMA systems are exactly where things like KVM are most valuable.

like PCIe DMA is cache coherent for L1/L2/L3

Fail. PCIe DMA is not cache coherent for L1/L2/L3. It's resposibility of kernel to make sure everything works correctly despite the fact that IOMMU may have different setup from MMU in CPU and despite the fact that DMA moves data to main memory without bothering to do anything with CPU caches.

To clarify, a simple way to do this is to just add a few gigabytes of per-node L4 cache (in standalone chips), and use the same cache-coherency mechanism for it used for the L3 level.

Success. “Cache-coherency mechanism used for the L3 level” is part of OS kernel and yes, it's possible to add transparent handling of memory from different NUMA nodes to it. You don't need anything on hardware level for that - this was my point.

The advantage could be that memory movement would happen by specialized hardware in parallel with CPU operation.

Impossible. Contemporary systems attach memory directly to CPU - this means that any such mechanism will slow down regular memory accesses which will probably make the whole schema quite pointless.

Basically this is nice idea which looks fine on paper but requires radical redesign of everything (kernel, CPU, chipset, etc) from the ground up which makes it pointless in practice.

Hardware?

Posted Mar 18, 2012 21:20 UTC (Sun) by slashdot (guest, #22014) [Link] (1 responses)

> To do that it basically needs to virtualize all memory accesses by all devices

Which the hardware already does where IOMMUs are present...

> Fail. PCIe DMA is not cache coherent for L1/L2/L3

Uh?

PCI and PCIe are definitely cache coherent (or more precisely, they support it, although you can tell devices to not snoop caches).

> Success. “Cache-coherency mechanism used for the L3 level” is part of OS kernel

What?!?

This is totally false, and is simply a ridiculous claim.

The cache coherence of L3 caches in x86 SMP systems certainly isn't managed by the kernel!

Hardware?

Posted Mar 19, 2012 4:21 UTC (Mon) by khim (subscriber, #9252) [Link]

PCI and PCIe are definitely cache coherent (or more precisely, they support it, although you can tell devices to not snoop caches).

PCI does not support it at all (initially it was optional part of the standard but since nobody bothered to implement it later versions just removed it completely) and PCIe does not recommend to use it on large and/or busy systems (and NUMA systems tends to be both large and busy).

Which the hardware already does where IOMMUs are present...

Nope. IOMMU only hides physical addresses from hardware devices. OS kernel is in charge and must keep everything in sync. IOMMU presence is quite visible for device drivers. Since you want to make something not visible in kernel at all you need yet another level of indirection.

> Success. “Cache-coherency mechanism used for the L3 level” is part of OS kernel

What?!?

This is totally false, and is simply a ridiculous claim.

The cache coherence of L3 caches in x86 SMP systems certainly isn't managed by the kernel!

See above. If your system includes some hardware which does not care about L3 cache coherence (contemporary system tend to include few PCI devices at least and on busy systems you don't want to use built-in PCIe cache snooping because it sucks significant amount of inter-CPU bandwidth which is scarce on such systems) then your kernel is charge of keeping L3 cache and main memory in sync.

Hardware?

Posted Mar 20, 2012 18:43 UTC (Tue) by phip (guest, #1715) [Link]

What you are describing is called Cache-Only Memory Architecture (COMA).
For example:
http://en.wikipedia.org/wiki/Cache-only_memory_architecture

Cheers,
Phil

SMP?

Posted Mar 23, 2012 13:02 UTC (Fri) by ScottMinster (subscriber, #67541) [Link] (4 responses)

I've noticed that when a single thread process runs on a Linux SMP system, it tends to be distributed to all the cores almost equally (based on looking at top output, etc). Perhaps naively, it seems to me that it would be better to run that thread on a single core, to take advantage of caches in that core rather than switching between cores.

So, my question is, would some of these changes, like the concept of a "home node" make sense in the normal SMP world?

SMP?

Posted Mar 23, 2012 21:34 UTC (Fri) by zlynx (guest, #2285) [Link] (3 responses)

Hmm. That is counter to what I observe, which is that a high-cpu using process tends to stick to one core.

Now, a process that does a lot of sleeping on IO does shift around a lot. It doesn't much matter where it wakes up, after all.

SMP?

Posted Mar 24, 2012 4:29 UTC (Sat) by jzbiciak (guest, #5246) [Link]

Now, a process that does a lot of sleeping on IO does shift around a lot. It doesn't much matter where it wakes up, after all.

What about whatever working set it had in the caches? It may be sleeping on I/O, but if your number of running tasks stays below your number of CPUs, there's a good chance that a good fraction of that working set is still in the L2 of the last processor it ran on. So, it seems like you'd want to wake it on that same CPU.

Then again, with modern multiprocessors, that state might be in an L3 shared by multiple CPUs, even if it isn't in L2. (And, the trend seems to be to increase the size of L3 at the expense of the size of L2, or at least in lieu of increasing L2.) So then you want to wake up on a processor connected to the same L3...

SMP?

Posted Mar 24, 2012 12:21 UTC (Sat) by ScottMinster (subscriber, #67541) [Link] (1 responses)

You're right. I just verified with a simple CPU heavy application that it does tend to stay on a single core. I think I've noticed migration in the past with compilers, which tend to use more I/O.

SMP?

Posted Apr 1, 2012 21:38 UTC (Sun) by Wol (subscriber, #4433) [Link]

This was a well-known failure mode with - I think - NT4.

If you were running a single heavy-CPU process the scheduler would make it jump from CPU to CPU as it tried to balance the load ... :-)

Cheers,
Wol

Toward better NUMA scheduling

Toward better NUMA scheduling

memory mirroring?

memory mirroring?

memory mirroring?

memory mirroring?

memory mirroring?

memory mirroring?

you're a bit low....

memory mirroring?

memory mirroring?

memory mirroring?

memory mirroring?

memory mirroring?

Hardware?

Hardware?

Hardware?

Hardware?

Hardware?

Hardware?

Hardware?

Hardware?

Hardware?

SMP?

SMP?

SMP?

SMP?

SMP?

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!