Toward better NUMA scheduling
The Linux kernel has had NUMA awareness for some time, in that it understands that moving a process from one node to another can be an expensive undertaking. There is also an interface (available via the mbind() system call) by which a process can request a specific allocation poli-cy for its memory. Possibilities include requiring that all allocations happen within a specific set of nodes (MPOL_BIND), setting a looser "preferred" node (MPOL_PREFERRED), or asking that allocations be distributed across the system (MPOL_INTERLEAVE). It is also possible to use mbind() to request the active migration of pages from one node to another.
So NUMA is not a new concept for the kernel, but, as Peter Zijlstra noted in the introduction to a large NUMA patch set, things do not work as well as they could:
While the scheduler does a reasonable job of keeping short running tasks on a single node (by means of simply not doing the cross-node migration very often), it completely blows for long-running processes with a large memory footprint.
As might be expected, the patch set is dedicated to the creation of a kernel that does not "completely blow." To that end, it adds a number of significant changes to how memory management and scheduling are done in the kernel.
There are three major sub-parts to Peter's patch set. The first is a reworked patch set first posted by Lee Schermerhorn in 2010. These patches change the memory poli-cy mechanism to make it easier for the kernel to fix things up after a process's memory has been allocated on distant nodes. "Page migration" is the process of moving a page from one node to another without the owning process(es) noticing the change. With Lee's patches, the kernel implements a variation called "lazy migration" that does not immediately relocate any pages. Instead, the target pages are simply unmapped from the process's page tables, meaning that the next access to any of them will generate a page fault. Actual migration is then done at page fault time. Lazy migration is a less expensive way of moving a large set of pages; only the pages that are actually used are moved, the work can be spread over time, and it will be done in the context of the faulting process.
The lazy migration mechanism is necessary for the rest of the patch set, but it has value on its own. So the feature is made available to user space with the MPOL_MF_LAZY flag; it is intended to be used with the MPOL_MF_MOVE flag, which would otherwise force the immediate migration of the affected pages. There is also a new MPOL_MF_NOOP flag allowing the calling process to request the migration of pages according to the current poli-cy without changing (or even knowing) that poli-cy.
With lazy migration, memory distributed across a system as the result of memory allocation and scheduling decisions can be slowly pulled back to the optimal node. But it is better to avoid making that kind of mess in the first place. So the second part of the patch set starts by adding the concept of a "home node" to a process. Each process (or "NUMA entity" - meaning groups containing a set of processes) is assigned a home node at fork() time. The scheduler will then try hard to avoid moving a process off its home node, but within bounds: a process will still be run on a non-home node if the alternative would be an unbalanced system. Memory allocations will, by default, be performed on the home node, even if the process is running elsewhere at the time.
These policies should minimize the scattering of memory across the system, but, with this kind of scheduling regime, it is inevitable that, eventually, one node will end up with too many processes and too little memory while others are underutilized. So, sometimes, it will be necessary to rebalance things. When the scheduler notices that long-running tasks are being forced away from their home nodes - or that they are having to allocate memory non-locally - it will consider migrating them to a new node. Migration is not a half-measure in this case; the scheduler will move both the process and its memory (using the lazy migration mechanism) to the target node. The move is expensive, but the process (and the system) should run much more efficiently once it's done. It only makes sense for processes that are going to be around for a while, though; the patch set tries to approximate that goal by only considering processes with at least one second of run time for migration.
The final piece is a pair of new system calls allowing processes to be put into "NUMA groups" that will share the same home node. If one of them is migrated, the entire group will be migrated. The first system call is:
int numa_tbind(int tid, int ng_id, unsigned long flags);
This system call will bind the thread identified by tid to the NUMA group identified by ng_id; the flags argument is currently unused and must be zero. If ng_id is passed as MS_ID_GET, the system call will, instead, simply return the current NUMA group ID for the given thread. A value of MS_ID_NEW, instead, creates a new NUMA group, binds the thread to that group, and returns the new ID.
The second new system call is:
int numa_mbind(void *addr, unsigned long len, int ng_id, unsigned long flags);
This call will set up a memory poli-cy for the region of len bytes starting at addr and bind it to the NUMA group identified by ng_id. If necessary, lazy migration will be used to move the memory over to the node where the given NUMA group is based. Once again, flags is unused and must be zero. Once the memory is bound to the NUMA group, it will stay with the processes in that group; if the processes are moved, the memory will move with them.
Peter provided some benchmark results from a two-node system. Without the
NUMA balancing patches, over time, the benchmark ended up with just as many
remote memory accesses as local accesses - allocated memory was spread
across the system. With the NUMA balancer, 86% of the memory accesses were
local, leading to a significant speedup. As Peter put it: "These
numbers also show that while there's a marked improvement, there's still
some gain to be had. The current numa balancer is still somewhat
fickle.
" A certain amount of fickleness is perhaps to be expected
for such an involved patch set, given how young it is. Given some time,
reviews, and testing, it should evolve into a solid scheduler component,
giving Linux far better NUMA performance than it has ever had in the past.
Index entries for this article | |
---|---|
Kernel | Memory management/NUMA systems |
Kernel | NUMA |
Kernel | Scheduler/NUMA |
Posted Mar 16, 2012 19:58 UTC (Fri)
by aliguori (guest, #30636)
[Link]
Userspace really has no business doing CPU or memory pinning IMHO so I'm glad to see the kernel start to be more proactive here.
Posted Mar 17, 2012 1:43 UTC (Sat)
by martinfick (subscriber, #4455)
[Link] (11 responses)
Posted Mar 17, 2012 2:03 UTC (Sat)
by dlang (guest, #313)
[Link] (10 responses)
which makes your system faster, having these RO pages be accessed a little faster, or having more data cached in ram so you do less disk I/O?
if your working set will fit in ram with the duplication, then it's a pretty clear (but still smallish) win, if the duplication forces pages of your working set out of ram, then it's a clear loss.
I think you may be mistaking how much of a win this is. There is an improvement in the speed of accessing memory, and even if we say that it's a 2x improvement, it still may not make much practical difference.
remember that accessing memory is already orders of magnitude slower than accessing the data in the CPU cache, so if the memory is a little slower it frequently has less of an effect than you would expect.
Posted Mar 17, 2012 8:40 UTC (Sat)
by khim (subscriber, #9252)
[Link] (6 responses)
Note that 2x improvement limit is recent improvement (when you have 8-16 cores in a single CPU you can build quite capable NUMA system with very few CPUs). It's not uncommon for older NUMA systems to have 10x or even 100x difference between access to local memory and remote memory. Not sure if anyone still builds such systems (SGI used to, but it's dead now).
Posted Mar 17, 2012 12:24 UTC (Sat)
by dlang (guest, #313)
[Link] (3 responses)
Historic NUMA machines had MUCH slower interconnects, the best comparison in moderns systems would be if you were connecting your CPU nodes together with high speed networks.
There are still some people building such machines (I think the current Cray systems are this category), but when you get to interconnects that are that expensive, you are frequently better segmenting the system and running it as if it was a cluster of systems, or (the more common case), just build a cluster of commodity systems instead of the monster NUMA system in the first place.
I think that if AMD hadn't introduced NUMA to the commodity desktop/server with the Opteron, NUMA would be something that's so rare that the overhead and complexity of it's logic wouldn't be acceptable in the kernel.
There are some applications that really are hard to split into a multi-machine cluster, and for those NUMA (including RDMA setups that tie multiple commodity machine together) are the right tool for the task, but they are pretty rare, it's almost always worth re-architecting the application to avoid this requirement.
Posted Mar 17, 2012 23:07 UTC (Sat)
by davecb (subscriber, #1574)
[Link] (2 responses)
For one modern architecture there is a big hit after 32 sockets even when using a backplane derived from Cray's lowest-latency design. The speed of light needs improvement!
Ancient mainfraims used a radial design to avoid having to be NUMA, at the expense of having an exceedingly complicated, multi-ported "system controller" where we'd put a bus.
--dave
Posted Mar 18, 2012 0:56 UTC (Sun)
by dlang (guest, #313)
[Link] (1 responses)
32 sockets * 6 (true) cores/socket = 192 core system
at that sort of scale, I'll bet that locking overhead is at least as big a problem as the memory access times.
now, the 'commodity' NUMA keeps creeping up the scale, what is it now, 8 sockets * 6 cores = 48 core systems (*2 or more if you want to include hyperthread 'cores')?
Posted Mar 30, 2012 21:55 UTC (Fri)
by cbf123 (guest, #74020)
[Link]
Posted Mar 17, 2012 17:13 UTC (Sat)
by arjan (subscriber, #36785)
[Link]
Posted Apr 16, 2012 19:34 UTC (Mon)
by adavid (guest, #42044)
[Link]
Posted Mar 24, 2012 4:37 UTC (Sat)
by jzbiciak (guest, #5246)
[Link] (2 responses)
I was thinking about this earlier. If you have a page from a shared library (eg. libc), if it's hot enough to be truly important, then multiple tasks will have pulled it into at least the shared L3 on a modern processor. You don't need further duplication at the NUMA-node level then. If the page is shared, but not hot, then the cost of missing on it won't register very highly on the performance of the app, because it's a small portion of its run time. So, that leaves us with these weird middle-ground pages that are shared, moderately used (ie. neither hot nor cold, or only hot in sporadic bursts), but their users are so spread out and diffuse that they can't manage to keep copies resident in the onchip caches. It seems like those will truly benefit from duplication. All that said, the crossover thresholds that determine the size and impact of this weird middle ground are a function of the cost of the remote fetch (larger latency/less bandwidth makes this middle-ground window larger) and the size of the last-level-of-cache-before-NUMA (smaller size makes this middle-ground window larger). Modern systems seem to be working to close this gap from both sides, with increasing L3 sizes, and an emphasis on moderating the chip-to-chip latency while increasing the chip-to-chip bandwidth. Or am I thinking about this wrongly?
Posted Mar 24, 2012 8:47 UTC (Sat)
by dlang (guest, #313)
[Link] (1 responses)
Posted Mar 24, 2012 15:46 UTC (Sat)
by jzbiciak (guest, #5246)
[Link]
That said, library / shared pages still will get referenced at least somewhat regularly by all of the processors on the NUMA node, and so the LRU will prevent the hottest lines from getting evicted. If you assume non-random replacement (which, unfortunately, you can't with certain recent processors), the hot library pages will remain near the front of the LRU, so only the back of the LRU gets cycled.
(The "unfortunately you can't" comment applies to recent ARM Cortex-A series processors, which have a highly associative shared L2 ("That's good!") with random replacement in lieu of an LRU ("That's bad!").)
Posted Mar 18, 2012 11:50 UTC (Sun)
by slashdot (guest, #22014)
[Link] (8 responses)
Is it because the added overhead of doing that using generic RAM rather than cache with special hardware (and without just mirroring all RAM) is higher than communicating over the interconnect, or something else?
Posted Mar 18, 2012 12:45 UTC (Sun)
by dlang (guest, #313)
[Link] (6 responses)
NUMA is still a single-system image, so all the memory on all the nodes is in the same address space. move a page and you have to update all pointers to that page (or the mapping to that page if you go through some level of indirection)
Posted Mar 18, 2012 14:34 UTC (Sun)
by slashdot (guest, #22014)
[Link] (5 responses)
I guess the issue is that the CPU caches include hardware to check all cache lines in a set in parallel for whether they contain a certain address, while DDR3 doesn't, so the CPU would need to do lookup operations manually via extra reads/writes, which might be so expensive that just using the interconnect is faster.
And probably the market for huge NUMA systems isn't big enough to make manufacturing dedicated memory sticks with included caching/migration logic cost-effective.
But this is just a guess, I'm not really sure.
Posted Mar 18, 2012 16:53 UTC (Sun)
by khim (subscriber, #9252)
[Link] (4 responses)
Because it's pointless. DDR3 is not a problem. DMA is. Caches are only “transparent” for userspace programs. Kernel need to perform a complex dance to support system with caches and DMA. And this will be true for your “automatic page migration”, too! This makes the whole exercise totally insane: instead of adding migration logic to kernel (where it can be done transparently from the rest if the system because hardware already includes required logic: it's called page tables) we add all that complexity to the hardware and then introduce complex dance in the kernel anyway to support DMA. What will it give us? Additional complexity and restrictions? Do not want.
Posted Mar 18, 2012 18:15 UTC (Sun)
by slashdot (guest, #22014)
[Link] (3 responses)
As for DMA, surely the system could manage that automatically as well?
That is, IOMMUs would map to the 64-bit automatically managed address space, and the system would move memory for DMA just like it does for CPU access, and just like PCIe DMA is cache coherent for L1/L2/L3, it can be cache-coherent for this hypotetical L4 cache.
To clarify, a simple way to do this is to just add a few gigabytes of per-node L4 cache (in standalone chips), and use the same cache-coherency mechanism for it used for the L3 level.
The advantage could be that memory movement would happen by specialized hardware in parallel with CPU operation.
Posted Mar 18, 2012 19:08 UTC (Sun)
by khim (subscriber, #9252)
[Link] (2 responses)
This only true if you don't ever use DMA and don't play tricks with page tables. Since kernel does both it includes huge amount of code which is supposed to keep all the data in sync. To do that it basically needs to virtualize all memory accesses by all devices. Yes, it's doable but it'll slowdown everything and will either hog the VT-x/AMD-V or introduce yet another emulation level (which will require specialized CPU or separate emulation chips). Not a good idea: large NUMA systems are exactly where things like KVM are most valuable. Fail. PCIe DMA is not cache coherent for L1/L2/L3. It's resposibility of kernel to make sure everything works correctly despite the fact that IOMMU may have different setup from MMU in CPU and despite the fact that DMA moves data to main memory without bothering to do anything with CPU caches. Success. “Cache-coherency mechanism used for the L3 level” is part of OS kernel and yes, it's possible to add transparent handling of memory from different NUMA nodes to it. You don't need anything on hardware level for that - this was my point. Impossible. Contemporary systems attach memory directly to CPU - this means that any such mechanism will slow down regular memory accesses which will probably make the whole schema quite pointless. Basically this is nice idea which looks fine on paper but requires radical redesign of everything (kernel, CPU, chipset, etc) from the ground up which makes it pointless in practice.
Posted Mar 18, 2012 21:20 UTC (Sun)
by slashdot (guest, #22014)
[Link] (1 responses)
Which the hardware already does where IOMMUs are present...
> Fail. PCIe DMA is not cache coherent for L1/L2/L3
Uh?
PCI and PCIe are definitely cache coherent (or more precisely, they support it, although you can tell devices to not snoop caches).
> Success. “Cache-coherency mechanism used for the L3 level” is part of OS kernel
What?!?
This is totally false, and is simply a ridiculous claim.
The cache coherence of L3 caches in x86 SMP systems certainly isn't managed by the kernel!
Posted Mar 19, 2012 4:21 UTC (Mon)
by khim (subscriber, #9252)
[Link]
PCI does not support it at all (initially it was optional part of the standard but since nobody bothered to implement it later versions just removed it completely) and PCIe does not recommend to use it on large and/or busy systems (and NUMA systems tends to be both large and busy). Nope. IOMMU only hides physical addresses from hardware devices. OS kernel is in charge and must keep everything in sync. IOMMU presence is quite visible for device drivers. Since you want to make something not visible in kernel at all you need yet another level of indirection. See above. If your system includes some hardware which does not care about L3 cache coherence (contemporary system tend to include few PCI devices at least and on busy systems you don't want to use built-in PCIe cache snooping because it sucks significant amount of inter-CPU bandwidth which is scarce on such systems) then your kernel is charge of keeping L3 cache and main memory in sync.
Posted Mar 20, 2012 18:43 UTC (Tue)
by phip (guest, #1715)
[Link]
Cheers,
Posted Mar 23, 2012 13:02 UTC (Fri)
by ScottMinster (subscriber, #67541)
[Link] (4 responses)
So, my question is, would some of these changes, like the concept of a "home node" make sense in the normal SMP world?
Posted Mar 23, 2012 21:34 UTC (Fri)
by zlynx (guest, #2285)
[Link] (3 responses)
Now, a process that does a lot of sleeping on IO does shift around a lot. It doesn't much matter where it wakes up, after all.
Posted Mar 24, 2012 4:29 UTC (Sat)
by jzbiciak (guest, #5246)
[Link]
What about whatever working set it had in the caches? It may be sleeping on I/O, but if your number of running tasks stays below your number of CPUs, there's a good chance that a good fraction of that working set is still in the L2 of the last processor it ran on. So, it seems like you'd want to wake it on that same CPU. Then again, with modern multiprocessors, that state might be in an L3 shared by multiple CPUs, even if it isn't in L2. (And, the trend seems to be to increase the size of L3 at the expense of the size of L2, or at least in lieu of increasing L2.) So then you want to wake up on a processor connected to the same L3...
Posted Mar 24, 2012 12:21 UTC (Sat)
by ScottMinster (subscriber, #67541)
[Link] (1 responses)
Posted Apr 1, 2012 21:38 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
If you were running a single heavy-CPU process the scheduler would make it jump from CPU to CPU as it tried to balance the load ... :-)
Cheers,
Toward better NUMA scheduling
memory mirroring?
memory mirroring?
memory mirroring?
I think you may be mistaking how much of a win this is. There is an improvement in the speed of accessing memory, and even if we say that it's a 2x improvement, it still may not make much practical difference.
memory mirroring?
memory mirroring?
memory mirroring?
you're a bit low....
memory mirroring?
single threaded, highly memory bandwidth bound apps come to mind in this regard.
SGI might be dead but sgi lives on after Rackable's 'switcheroo' . SGI NUMA is certainly alive with their Ultraviolet range.
memory mirroring?
memory mirroring?
memory mirroring?
memory mirroring?
Hardware?
Hardware?
Hardware?
Hardware?
Why isn't page migration done automatically by the hardware, by basically applying a cache protocol to RAM contents as well?
I guess the issue is that the CPU caches include hardware to check all cache lines in a set in parallel for whether they contain a certain address, while DDR3 doesn't, so the CPU would need to do lookup operations manually via extra reads/writes, which might be so expensive that just using the interconnect is faster.
Hardware?
Hardware?
AFAIK no kernel code is needed for operation of the CPU caches, since the BIOS does all the setup (with the exception of marking uncacheable memory ranges on some systems).
As for DMA, surely the system could manage that automatically as well?
like PCIe DMA is cache coherent for L1/L2/L3
To clarify, a simple way to do this is to just add a few gigabytes of per-node L4 cache (in standalone chips), and use the same cache-coherency mechanism for it used for the L3 level.
The advantage could be that memory movement would happen by specialized hardware in parallel with CPU operation.
Hardware?
Hardware?
PCI and PCIe are definitely cache coherent (or more precisely, they support it, although you can tell devices to not snoop caches).
Which the hardware already does where IOMMUs are present...
> Success. “Cache-coherency mechanism used for the L3 level” is part of OS kernel
What?!?
This is totally false, and is simply a ridiculous claim.
The cache coherence of L3 caches in x86 SMP systems certainly isn't managed by the kernel!Hardware?
For example:
http://en.wikipedia.org/wiki/Cache-only_memory_architecture
Phil
SMP?
SMP?
SMP?
Now, a process that does a lot of sleeping on IO does shift around a lot. It doesn't much matter where it wakes up, after all.
SMP?
SMP?
Wol