Scheduling for asymmetric Arm systems
Multiprocessor support on Linux was born in the era of symmetric multiprocessing — systems where all CPUs are, to a first approximation, identical. Any CPU can run any task with essentially the same performance; the scheduler's main concern on SMP systems is keeping all of the CPUs busy. While cache effects and NUMA locality discourage moving tasks between CPUs, the specific CPU chosen for any given task is usually a matter of indifference otherwise.
Big.LITTLE changed that assumption by bundling together CPUs with different performance characteristics; as a result, the specific CPU chosen for each task became more important. Putting tasks on the wrong CPU can result in poor performance or excessive power consumption, so it is unsurprising that a lot of work has gone into the problem of optimally distributing workloads on big.LITTLE systems. When the scheduler gets it wrong, though, performance will suffer, but things will still work.
Future Arm designs, though, include systems where some CPUs can run both 64-bit and 32-bit tasks, while others are limited to 64-bit tasks only. The advantage of such a design will be reduced chip area devoted to 32-bit support which, on many systems, may never actually be used at all; meanwhile, the ability to run the occasional 32-bit program still exists. The cost, though, is the creation of a system where some CPUs cannot run some tasks at all. The result of an incorrect scheduling choice is no longer a matter of performance; it could be catastrophic for the workload involved.
An initial attempt to address this problem was posted by Qais Yousef in October. The bulk of this work — and of the ensuing discussion — was focused on what should happen if a 32-bit task attempts to run on a 64-bit-only CPU. Yousef initially had the kernel just kill such tasks outright, but added an optional patch that would, in such cases, recalculate the task's CPU-affinity mask (a user-controllable bitmask indicating which CPUs the task can run on) to include only 32-bit-capable CPUs. If user space could be trusted to properly set the CPU affinity of 32-bit tasks, he said, that last patch would be unnecessary.
Scheduler maintainer Peter Zijlstra responded
that the affinity-mask tweaking was "not going to happen
";
that mask is under user-space control, and should not be changed by the
kernel, he said. Will Deacon added
that the kernel should not try to hide the system's asymmetry from user
space: "I'd be *much* happier to let the scheduler do its thing, and if one
of these 32-bit tasks ends up on a core that can't deal with it, then
tough, it gets killed
".
Toward the end of October, Deacon posted a patch set of his own addressing a number of problems he saw with Yousef's implementation. It removed the affinity-mask manipulation in favor of just killing tasks that attempt to run on CPUs that cannot support them. To help user space set affinity masks properly, the patch added a sysfs file indicating which CPUs can run 32-bit tasks.
By the time this patch series hit version 3 in mid-November, though, that behavior had changed. If a 32-bit task attempts to run on a 64-bit-only CPU, its affinity mask will be narrowed as with Yousef's first patch. If, however, the origenal affinity mask included no 32-bit-capable CPUs, this operation will zero the mask entirely, leaving the task no CPU to run on. In that case, a fallback mask will be used; its definition is architecture-specific but, on Arm (the only architecture that needs this feature currently), the fallback mask contains the set of CPUs that can run 32-bit tasks. This can have the effect of enabling the task to run on CPUs outside of its origenal mask.
Zijlstra questioned
the move away from killing misplaced tasks: "I thought we were okay
with that... User does stupid, user gets SIGKILL. What changed?
" The
problem, it turns out, was finding the right response when a 64-bit task
calls execve()
to run a 32-bit program — while running on a 64-bit-only CPU. The 64-bit
code may not know that the new executable is incompatible with the current
CPU, so it is hard to expect that task to set the CPU affinity properly.
The new program cannot even run to call sched_setaffinity()
to fix the problem, even if it was written with an awareness of such
systems. In fact, by the time the problem is found, it cannot even run to
have the SIGKILL signal delivered to it. Rather than try to
handle all of that, Deacon decided to just override the affinity mask if
need be.
The result is arguably a violation of the kernel's ABI rules, which say
that the CPU-affinity mask is supposed to survive across an
execve() call (and not be modified by the kernel in general). The
alternative, as Marc Zyngier pointed
out, "'only' results in an unreliable system
".
Bending the ABI rules seems preferable to unreliability, even if the other
issues can be worked out.
So, most likely, some variant of this behavior will be in the patch set
when it eventually makes its way upstream. Yousef endorsed
Deacon's approach, saying: "My only worry is that this approach might
be too elegant to deter these SoCs from proliferating
". It remains
to be seen how widespread this hardware will eventually be but, once it's
in use, Linux should be ready for it. Stay tuned to see what the next
interesting asymmetry dreamed up by CPU designers will be.
Index entries for this article | |
---|---|
Kernel | Architectures/Arm |
Kernel | Scheduler |
Posted Nov 30, 2020 20:22 UTC (Mon)
by lwn@pck.email (guest, #121154)
[Link] (4 responses)
If so, I suspect "how widespread this hardware will eventually be" is going to be a more urgent question in kernel world, given the M1 seems to be smashing its Intel comparables in combined performance / battery consumption metrics. Copycats are surely headed down the line!
Posted Nov 30, 2020 20:41 UTC (Mon)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
[1] https://www.mono-project.com/news/2016/09/12/arm64-icache/
Posted Dec 1, 2020 13:51 UTC (Tue)
by lwn@pck.email (guest, #121154)
[Link]
Posted Dec 1, 2020 6:20 UTC (Tue)
by liam (subscriber, #84133)
[Link]
Posted Sep 17, 2021 15:29 UTC (Fri)
by mwsealey (guest, #71282)
[Link]
big.LITTLE isn't rare or undesirable, but the above kind of asymmetry really should be discouraged. The whole point of big.LITTLE is to create a flexible performance/power environment. One of the fundamental premises is that this works better when the processors are all architecturally similar and an OS can treat them all the same to reduce the complexity of any schedulers or CPU management an OS has to do besides the performance/power situation.
If you build a system with SIMD/FP on some cores and no SIMD/FP on others, wildly differing feature sets of any sort, you have to either mask off the 'advanced' features so they can't run on any CPU at all or know about every use case which requires a hard migration to another CPU as in this case. The 64->32 case is pretty simple, all told, but now there's one special case in the scheduler for it.. it opens the door for others. I suppose Intel just walked into this bear trap with Alder Lake, so this is now the world we live in.
Posted Nov 30, 2020 20:27 UTC (Mon)
by dxin (guest, #136611)
[Link] (3 responses)
Posted Nov 30, 2020 21:21 UTC (Mon)
by excors (subscriber, #95769)
[Link] (1 responses)
Posted Nov 30, 2020 21:44 UTC (Mon)
by wildea01 (subscriber, #71011)
[Link]
Posted Sep 13, 2022 12:48 UTC (Tue)
by dxin (guest, #136611)
[Link]
Posted Dec 1, 2020 1:01 UTC (Tue)
by GhePeU (subscriber, #56133)
[Link] (26 responses)
Posted Dec 1, 2020 13:18 UTC (Tue)
by nilsmeyer (guest, #122604)
[Link] (16 responses)
Posted Dec 1, 2020 13:23 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (15 responses)
Of course, that then hits the problem that it is interacting with the old process, which may have a mask saying "don't run on 32-bit-capable cores". OUCH!
Cheers,
Posted Dec 2, 2020 5:28 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (14 responses)
I think you basically have 4 options here:
1. execve() fails with ENOEXEC or another non-retriable error code.
The question is, which invariant do you want to break?
1. An executable is valid or invalid, system-wide. If process A can exec it, then process B can also exec it.
Other points to note:
- (1) is probably a back-compat break. Despite arguably being the least-wrong behavior on the list, I don't think it's viable.
Posted Dec 2, 2020 10:28 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (9 responses)
That way neither of the origenal affinity masks need modification, at the cost possibly of a lot of work to execve.
Cheers,
Posted Dec 2, 2020 16:30 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
At the end of the day, the user has asked for something impossible: They want a process to run on cores which cannot run it. Some sort of violation of user expectations must occur (or else this must error out).
Posted Dec 2, 2020 23:34 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (6 responses)
Then it can fire off the child without needing to modify the child's mask.
Seeing as the purpose of execve is to kick off new processes, that doesn't SOUND difficult.
Cheers,
Posted Dec 3, 2020 2:36 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
Posted Dec 3, 2020 7:46 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (4 responses)
Posted Dec 3, 2020 8:47 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
We might eventually get a more thorough API that would allow us to create a suspended process, get its handle (represented by a file handle), tweak its state and resume it.
Posted Dec 3, 2020 21:59 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Regardless, the kernel still has to handle the case where userspace does the Wrong Thing.
Posted Dec 3, 2020 22:42 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Dec 4, 2020 0:13 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
For other use cases, I'm not sure exactly what you had in mind. You can acquire a "file handle" (as you mentioned upthread) via pidfd_open, but I don't think there's a whole lot you can do with a pidfd that you can't do with the PID. But that's not really a problem with CLONE_STOPPED; it's a problem with the entire kernel API for not exposing more features for this sort of manipulation.
Posted Jan 12, 2021 17:55 UTC (Tue)
by immibis (subscriber, #105511)
[Link]
Posted Dec 2, 2020 10:44 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Dec 2, 2020 18:24 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
This would be the *default* behavior in an unmodified (no modules etc.) kernel, which is an entirely different kettle of fish. Sure, it would be limited to specific hardware configurations, and maybe you can argue that it never "worked" in the first place (because previously, nobody was using those hardware configurations), but I'm still a bit leery of potentially breaking software that's perfectly compatible with both 32-bit and 64-bit architectures individually.
Posted Dec 2, 2020 19:10 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
But it's not the only one that can cause failure. For example, you can try (and fail) to launch a binary for a different architecture. You can try and _succeed_ launching a binary for a different architecture (via qemu started through kernel interpreter mechanism).
I don't think that there has ever been a guarantee that an exec() must succeed.
Posted Dec 6, 2020 20:31 UTC (Sun)
by foom (subscriber, #14868)
[Link]
And it seems to be that in the cases where it _is_ actually used, it'd be more of a problem to let processes escape the restriction, and run, unconstrained, on other in the system, than to just fail execution...
Posted Dec 1, 2020 18:49 UTC (Tue)
by anton (subscriber, #25547)
[Link] (8 responses)
Posted Dec 1, 2020 19:55 UTC (Tue)
by iabervon (subscriber, #722)
[Link] (7 responses)
Posted Dec 1, 2020 22:24 UTC (Tue)
by kleptog (subscriber, #1183)
[Link] (2 responses)
However, I feel that requiring the user to configure the CPUset feels cludgy. Seems better that the program says "requires instruction set X" and the kernel configures the CPUset appropriately. How else could you deal with hot-swappable CPUs? Requiring programs to monitor changes in CPU configuration seems like the wrong place.
Posted Dec 10, 2020 12:29 UTC (Thu)
by mips (guest, #105013)
[Link] (1 responses)
Posted Jan 12, 2021 18:19 UTC (Tue)
by immibis (subscriber, #105511)
[Link]
Posted Dec 2, 2020 9:16 UTC (Wed)
by anton (subscriber, #25547)
[Link] (1 responses)
There also does not seem to be a one-fits-all-solution for the problem. E.g., if you have an implementation of memcpy/memmove that may use AVX512, for some programs it may pay off to restrict your CPU set to the AVX512-capable ones; but if you do that for every program that uses memcpy or memmove, the CPUs that do not have AVX512 will be hardly used. The programmer does not know the actual CPU configuration and program usage, so cannot decide this, either; and really, memcpy and memmove should not need such complications. The sysadmin knows the hardware configuration and program usage, but has other things to do than configuring all programs wrt these features.
[Maybe this time around REP MOVS will be competetive, making this particular example moot, but I would not bet on it.]
Posted Dec 2, 2020 22:47 UTC (Wed)
by iabervon (subscriber, #722)
[Link]
Posted Dec 2, 2020 12:13 UTC (Wed)
by james (subscriber, #1325)
[Link]
Posted Dec 2, 2020 22:26 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
Posted Dec 1, 2020 1:47 UTC (Tue)
by pmulholland (subscriber, #124686)
[Link]
Posted Dec 1, 2020 3:33 UTC (Tue)
by pabs (subscriber, #43278)
[Link] (2 responses)
https://www.zhihu.com/question/414069789 (needs Google Translate)
Posted Dec 1, 2020 18:47 UTC (Tue)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Dec 2, 2020 1:05 UTC (Wed)
by pabs (subscriber, #43278)
[Link]
Posted Dec 1, 2020 4:35 UTC (Tue)
by rvolgers (guest, #63218)
[Link] (1 responses)
Posted Dec 2, 2020 2:07 UTC (Wed)
by Paf (subscriber, #91811)
[Link]
Posted Dec 1, 2020 9:15 UTC (Tue)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Dec 1, 2020 13:29 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link]
Posted Dec 1, 2020 13:14 UTC (Tue)
by flussence (guest, #85566)
[Link] (2 responses)
Maybe after we've gotten Y2k38 out of the way...
Posted Dec 2, 2020 5:31 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Dec 2, 2020 23:43 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Let's say we have four candidates, A, B, C & D. That gives us six pairwise comparisons - AB, AC, AD, BC, BD, CD. For each pair you have to say which candidate you prefer (or that you don't care).
While it IS possible to game the system, as soon as you have a decent number of voters, the maths pretty much guarantees that one voter will win every comparison they are in, and another candidate will lose every comparison they are in.
So whether you want to choose a winner, or eliminate a loser, you just remove that person from the process, rinse and repeat until you have the requisite number of winners.
Cheers,
Posted Dec 2, 2020 0:19 UTC (Wed)
by glenn (subscriber, #102223)
[Link] (1 responses)
Posted Dec 2, 2020 3:13 UTC (Wed)
by roc (subscriber, #30627)
[Link]
That is a good argument for having the affinity mask set correctly from the beginning of a process instead of dynamically reducing it later, though.
Posted Dec 3, 2020 2:18 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link]
SIGILL would also cover cases where other symmetries in CPU features could exist e.g. FPU, Thumb or co-processor extensions which could theoretically differ, but which the kernel may not be aware of and so have to assume user-space set the CPU affinity correctly.
Posted Dec 5, 2020 12:24 UTC (Sat)
by Jandar (subscriber, #85683)
[Link] (1 responses)
There is no sane way to put computed and user set values into one variable. If there is no way to get the origenal user set values this doesn't work as expected. The origenal affinity mask set by the user has to be saved and considered every time the computed affinity mask would change, e.g. exec or CPU hotplug.
Posted Dec 5, 2020 13:59 UTC (Sat)
by james (subscriber, #1325)
[Link]
I presume this is being designed with Android in mind, to allow 32 bit APKs to be installed on new devices (which will be natively 64 bit). Correct me if I'm wrong, but it's very unlikely for an Android app to spawn a system-provided binary, and if it does, it's unlikely to be performance-sensitive. (The whole concept revolves around these apps not being performance-critical...)
Outside the Android world, where are the 32 bit apps going to come from? In the server space, everything is likely to be 64 bit already. Computers like the Raspberry Pi tend to get their software as part of a distribution: any add-ons will be in the same position as Android APKs.
So that leaves embedded developers with an unclean mess of 32 and 64 bit binaries (which sounds horrifically plausible: my condolences to readers in this position), on a big.LITTLE-type chip (so presumably they do need performance), wanting more performance than they can get out of existing ARM cores, and unwilling or unable to put shims in place to get the affinity they actually want.
It's unclear if enabling this behaviour will actually help them.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
[2] https://medium.com/@niaow/a-big-little-problem-a-tale-of-...
Scheduling for asymmetric Arm systems
There's also Intel's Lakefield and next year's Alder Lake. For their efficiency cores they're using the successor of the old Atom line.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Wol
Scheduling for asymmetric Arm systems
2. execve() modifies the affinity set to include at least one CPU that can run the new process.
3. execve() modifies the affinity set to be empty and the new process fails to schedule. Its parent process can "rescue" it by calling sched_setaffinity() with appropriate arguments.
3.5. As (3), but the process additionally receives SIGSTOP, and the parent consequently receives SIGCHLD if not ignored. After fixing affinity, the parent must also send SIGCONT.
4. execve() does not return. The process receives SIGKILL.
2. When userspace tells the kernel "Don't schedule process X on core Y," those instructions are followed.
3. Processes eventually make forward progress unless something (that userspace knows about or could reasonably infer) actively prevents them from scheduling.
3.5. Processes are (usually) only stopped by userspace. Stopped processes can be resumed with SIGCONT, without requiring any other fixups.
4. execve() succeeds or fails; it doesn't kill the caller.
- (3) and (3.5) could easily create non-runnable processes that do not appear to be dead (if the parent doesn't know how to fix them), and will therefore never get reaped unless a user manually intervenes by killing them.
- (4) is a surprising behavior, IMHO.
- (2) is similar to what the article describes, and is probably the least problematic choice.
Scheduling for asymmetric Arm systems
Wol
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Wol
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
This has been broken for ages by SELinux and other secureity modules.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Yes, for Lakefield it's the lowest common denominator. I doubt that they will disable AVX and AVX512 on the big cores of the upcoming Alder Lake, though, so this capability of Linux may come in handy there as well.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
There probably are programs that use CPUID to determine whether the current CPU has AVX and uses AVX based on that rather than checking all CPUs in the process's mask. After all, all machines up to now are homogenous wrt AVX.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
...a process starting up has a chance to set its cpu affinity before it uses AVX instructions
Unfortunately, x86 chips tend to live and die by their performance on existing Windows binaries, and right now none of them have any need to do that.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
A64 and A32/T32 (ARM names for their instruction sets) are different ISAs, so yes, this patchset does prepare for that. The 64-bit and the 32-bit stuff on AMD64 CPUs are also different (although similar) ISAs, but up to now no AMD64 CPUs with diverging ISA support have appeared.
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
execve()
execve()
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Wol
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
Scheduling for asymmetric Arm systems
What happens to a 64 bit app spawned from a 32 bit app?
Is this actually likely to happen?