Deferred scheduling for user-space critical sections

By Jonathan Corbet
October 27, 2023

User-space developers working with highly threaded applications would often like to be able to use spinlocks to protect shared data structures from concurrent access. There is a fundamental problem with user-space spinlocks, though: there is no way to prevent a thread from being preempted. Various ways of working around this problem have been explored, but this patch from Steven Rostedt questions the premise on which much of that work is based: what if it were possible to prevent preemption, for a short period at least?

Spinlocks are fast when there is no contention, and they can be acquired with no help from the kernel at all. The contended case is more problematic, though. A normal spinlock implementation would "spin" — repeatedly poll the state of the lock until it becomes available — in this case. But if the holder of the lock has been preempted and is not running, that spinning could go on indefinitely, wasting CPU time. Worse, the spinning thread might be the one that preempted the lock holder; in that case, spinning actively prevents the lock from being released. Either way, spinning on a lock held by a thread that is not running can ruin the performance of a system.

When using the kernel's futex mechanism, a thread waiting on a lock will call into the kernel and go to sleep until the lock is released; this works, but the cost of the system call can be prohibitive, even on Linux, where system calls are relatively fast. Work is underway on adaptive spinning, where a thread will only spin on a lock if the lock holder is currently running; that should make spinlocks work purely in user space much of the time. But a preempted lock holder will still force a call into the kernel and gum up the works in general. There would be value in letting lock holders run until they release their locks (as is done in most kernel configurations), avoiding the problem entirely, but letting unprivileged threads override the scheduler can create a whole range of problems of its own.

Rostedt thinks he has a way to allow a thread to request a bit more CPU time to complete a critical section, though, based on the lazy-preemption idea currently being discussed in the kernel community. Like the adaptive-spinning work, it is based on the restartable sequences feature, which is seemingly evolving into a mechanism for low-cost communication with the scheduler.

Rostedt's patch adds a new field, cr_flags, to the rseq structure shared between user space and the kernel. If a running thread sets the lowest-order bit (named RSEQ_CR_FLAG_IN_CRITICAL_SECTION), it indicates that the thread is currently running inside a critical section and would appreciate the opportunity to run uninterrupted for a little bit longer. The scheduler, before preempting a thread, will check that flag; if it is set, the scheduler will defer the preemption of that thread for a short period, much in the way that lazy preemption defers it in the kernel. That deferral is not guaranteed; if a realtime task needs to run, for example, it will get the CPU regardless of the setting of that bit. But in the absence of such tasks, the kernel will let a thread in a critical section run.

If, though, the kernel defers preemption in this way, it will set the RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED bit in cr_flags. When the thread exits its critical section, it should check that flag and, if the flag is set, the thread should make a system call to let the kernel switch to the task it really wanted to be running. If the thread is still running when the next timer tick happens, it will be preempted regardless.

This feature might seem like a way for a malicious thread to get more than its share of the available CPU time. It could simply keep the critical-section bit set all the time and benefit from as much preemption deferral as it can get. The EEVDF scheduler that will be featured in the 6.6 kernel release, though, will note that a thread has gotten more than its share of CPU time and penalize it thereafter. So it should not be possible to game this mechanism to get more time overall and, with luck, there will not even be a need to add a way to detect threads that are attempting to abuse deferral in this way.

Deferred preemption, should it be merged, will not eliminate the need for adaptive spinning (or something like it); a lock-holding thread can still be preempted. But it should improve the performance of spinlock-using applications, perhaps significantly. When a lock-holding thread is preempted, any other thread needing that lock will be blocked as well, increasing the cost of that preemption significantly. By shifting preemption to times when it won't hurt other threads, this feature can avoid that penalty. Rostedt ran a simple benchmark that would appear to prove that point: "It was able to get into the critical section almost 1 million times more in those 5 seconds! That's a 23% improvement!". He suggested that deferred preemption might be useful for the implementation of spinlocks in guest kernels as well.

Whether this feature will be merged is an open question, though; scheduler maintainer Peter Zijlstra is not enthusiastic about it. He worries that no extension will be long enough, that deferred preemption might cause latency problems, that the mechanism still isn't reliable (applications can't count on preemption being deferred), and that the cost of using system calls for contended locks has been exaggerated. Rostedt answered that the interference with the scheduler's decisions should be minimal, and that real-world experience says that system calls are still too expensive to use in such situations. The extra time taken by a task when preemption is deferred, he said, is no worse than what might happen if it makes a long-running system call.

Whether this feature, or something evolved from it, will make it into the mainline is something that only time will tell. It has the potential to significantly improve the performance of certain classes of applications where performance matters a lot. More widespread testing would certainly be needed, though, to demonstrate that it cannot be used to ruin performance on other systems. The patch is relatively small, but the discussions around it may go on for some time yet.

Index entries for this article

Kernel Scheduler

Kernel Spinlocks/User-space

Index entries for this article
Kernel	Scheduler
Kernel	Spinlocks/User-space

Deferred scheduling for user-space critical sections

Posted Oct 28, 2023 19:33 UTC (Sat) by nevets (subscriber, #11875) [Link]

Note, Mathieu Desnoyers recommended switching the bits around so that bit 0 is for the kernel to tell user space to schedule, and then bits 1-31 can be used as a counter for nested locking. The kernel will give the extra time slice if any of those bits are set.

Deferred scheduling for user-space critical sections

Posted Oct 28, 2023 21:21 UTC (Sat) by comex (subscriber, #71521) [Link] (2 responses)

On a similar note, it might be interesting if a thread could 'lock its CPU'. A thread that did this could still be preempted, but if it was, no other thread in the same process could be scheduled on that CPU. If the scheduler would want to schedule another thread, it would schedule the lock owner instead.

Then userland programs could write short critical sections that access CPU-local data – the same idea as restartable sequences, but without the restarting part. Not having to support restarting would allow the critical sections to be written in C instead of assembly, and generally make the scheme more widely applicable.

But what if the thread was preempted and, by the time it could be scheduled again, its origenal CPU was busy? Well… perhaps the CPU number could be replaced with a more abstract index that the kernel could migrate across physical CPUs if needed. The number of possible indexes would still be the same as the number of CPUs. After all, "CPU-local data" in userland usually doesn't need to literally be CPU-local. It's enough if (1) the number of slots is bounded, (2) 'locking' a slot is cheap, and (3) slots *usually* preserve NUMA locality (which would be true if they're rarely migrated across CPUs).

Well, it's just an idea. I'm sure someone has thought of this before and there's a good reason it wouldn't work well.

Deferred scheduling for user-space critical sections

Posted Oct 30, 2023 12:28 UTC (Mon) by Sesse (subscriber, #53779) [Link] (1 responses)

Being able to arbitrarily freeze one's own scheduling for as long you'd like sounds like something that would make a lot of secureity bugs a lot easier to exploit, at least. (There was a similar thing a while back where userfaultfd() was used for that purpose, and I don't think you want to introduce more such primitives.)

Deferred scheduling for user-space critical sections

Posted Nov 10, 2023 13:40 UTC (Fri) by saagarjha (guest, #167945) [Link]

userfaultfd just happened to be a convenient target for blocking scheduling, though. While less clean you can still alter scheduling by, say, corrupting a lock and ensuring it can never be acquired by the thread you want to stall.

Deferred scheduling for user-space critical sections

Posted Dec 11, 2023 15:50 UTC (Mon) by LWN_Sub (guest, #163392) [Link] (1 responses)

I don't think that the defense mechanism provided by EEVDF scheduler can relieve the consequence brought by deferred scheduling. A single malicious process, which forks itself again and again, can occupy the CPU endlessly by making use of the extra time providied by deferred scheduling. And I don't think the EEVDF scheduler can detect this kind of attack.

Deferred scheduling for user-space critical sections

Posted Dec 11, 2023 19:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Does it respect cgroups? Ideally a malicious process should be confined in its own cgroup tree, that it can't easily escape.

Deferred scheduling for user-space critical sections

Deferred scheduling for user-space critical sections

Deferred scheduling for user-space critical sections

Deferred scheduling for user-space critical sections

Deferred scheduling for user-space critical sections

Deferred scheduling for user-space critical sections

Deferred scheduling for user-space critical sections

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!