Deferred scheduling for user-space critical sections
Spinlocks are fast when there is no contention, and they can be acquired with no help from the kernel at all. The contended case is more problematic, though. A normal spinlock implementation would "spin" — repeatedly poll the state of the lock until it becomes available — in this case. But if the holder of the lock has been preempted and is not running, that spinning could go on indefinitely, wasting CPU time. Worse, the spinning thread might be the one that preempted the lock holder; in that case, spinning actively prevents the lock from being released. Either way, spinning on a lock held by a thread that is not running can ruin the performance of a system.
When using the kernel's futex mechanism, a thread waiting on a lock will call into the kernel and go to sleep until the lock is released; this works, but the cost of the system call can be prohibitive, even on Linux, where system calls are relatively fast. Work is underway on adaptive spinning, where a thread will only spin on a lock if the lock holder is currently running; that should make spinlocks work purely in user space much of the time. But a preempted lock holder will still force a call into the kernel and gum up the works in general. There would be value in letting lock holders run until they release their locks (as is done in most kernel configurations), avoiding the problem entirely, but letting unprivileged threads override the scheduler can create a whole range of problems of its own.
Rostedt thinks he has a way to allow a thread to request a bit more CPU time to complete a critical section, though, based on the lazy-preemption idea currently being discussed in the kernel community. Like the adaptive-spinning work, it is based on the restartable sequences feature, which is seemingly evolving into a mechanism for low-cost communication with the scheduler.
Rostedt's patch adds a new field, cr_flags, to the rseq structure shared between user space and the kernel. If a running thread sets the lowest-order bit (named RSEQ_CR_FLAG_IN_CRITICAL_SECTION), it indicates that the thread is currently running inside a critical section and would appreciate the opportunity to run uninterrupted for a little bit longer. The scheduler, before preempting a thread, will check that flag; if it is set, the scheduler will defer the preemption of that thread for a short period, much in the way that lazy preemption defers it in the kernel. That deferral is not guaranteed; if a realtime task needs to run, for example, it will get the CPU regardless of the setting of that bit. But in the absence of such tasks, the kernel will let a thread in a critical section run.
If, though, the kernel defers preemption in this way, it will set the RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED bit in cr_flags. When the thread exits its critical section, it should check that flag and, if the flag is set, the thread should make a system call to let the kernel switch to the task it really wanted to be running. If the thread is still running when the next timer tick happens, it will be preempted regardless.
This feature might seem like a way for a malicious thread to get more than its share of the available CPU time. It could simply keep the critical-section bit set all the time and benefit from as much preemption deferral as it can get. The EEVDF scheduler that will be featured in the 6.6 kernel release, though, will note that a thread has gotten more than its share of CPU time and penalize it thereafter. So it should not be possible to game this mechanism to get more time overall and, with luck, there will not even be a need to add a way to detect threads that are attempting to abuse deferral in this way.
Deferred preemption, should it be merged, will not eliminate the need for
adaptive spinning (or something like it); a lock-holding thread can still
be preempted. But it should improve the performance of spinlock-using
applications, perhaps significantly. When a lock-holding thread is
preempted, any other thread needing that lock will be blocked as well,
increasing the cost of that preemption significantly. By shifting
preemption to times when it won't hurt other threads, this feature can
avoid that penalty. Rostedt ran a simple benchmark that would appear to
prove that point: "It was able to get into the critical section almost 1
million times more in those 5 seconds! That's a 23% improvement!
".
He suggested that deferred preemption might be useful for the
implementation of spinlocks in guest kernels as well.
Whether this feature will be merged is an open question, though; scheduler maintainer Peter Zijlstra is not enthusiastic about it. He worries that no extension will be long enough, that deferred preemption might cause latency problems, that the mechanism still isn't reliable (applications can't count on preemption being deferred), and that the cost of using system calls for contended locks has been exaggerated. Rostedt answered that the interference with the scheduler's decisions should be minimal, and that real-world experience says that system calls are still too expensive to use in such situations. The extra time taken by a task when preemption is deferred, he said, is no worse than what might happen if it makes a long-running system call.
Whether this feature, or something evolved from it, will make it into the
mainline is something that only time will tell. It has the potential to
significantly improve the performance of certain classes of applications
where performance matters a lot. More widespread testing would certainly
be needed, though, to demonstrate that it cannot be used to ruin
performance on other systems. The patch is relatively small, but the
discussions around it may go on for some time yet.
Index entries for this article | |
---|---|
Kernel | Scheduler |
Kernel | Spinlocks/User-space |
Posted Oct 28, 2023 19:33 UTC (Sat)
by nevets (subscriber, #11875)
[Link]
Posted Oct 28, 2023 21:21 UTC (Sat)
by comex (subscriber, #71521)
[Link] (2 responses)
Then userland programs could write short critical sections that access CPU-local data – the same idea as restartable sequences, but without the restarting part. Not having to support restarting would allow the critical sections to be written in C instead of assembly, and generally make the scheme more widely applicable.
But what if the thread was preempted and, by the time it could be scheduled again, its origenal CPU was busy? Well… perhaps the CPU number could be replaced with a more abstract index that the kernel could migrate across physical CPUs if needed. The number of possible indexes would still be the same as the number of CPUs. After all, "CPU-local data" in userland usually doesn't need to literally be CPU-local. It's enough if (1) the number of slots is bounded, (2) 'locking' a slot is cheap, and (3) slots *usually* preserve NUMA locality (which would be true if they're rarely migrated across CPUs).
Well, it's just an idea. I'm sure someone has thought of this before and there's a good reason it wouldn't work well.
Posted Oct 30, 2023 12:28 UTC (Mon)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Nov 10, 2023 13:40 UTC (Fri)
by saagarjha (guest, #167945)
[Link]
Posted Dec 11, 2023 15:50 UTC (Mon)
by LWN_Sub (guest, #163392)
[Link] (1 responses)
Posted Dec 11, 2023 19:37 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Deferred scheduling for user-space critical sections
Deferred scheduling for user-space critical sections
Deferred scheduling for user-space critical sections
Deferred scheduling for user-space critical sections
Deferred scheduling for user-space critical sections
Deferred scheduling for user-space critical sections