3.0 and RCU: what went wrong
My goal has always been for my code to go in without so much as a ripple. Although I don't always meet that goal, I can't recall any recent failure quite as spectacular as RCU in v3.0. My v3.0 code didn't just cause a few ripples, it bellyflopped. It is therefore worthwhile to review what happened and why it happened in order to avoid future bellyflops and trainwrecks.
This post-mortem will cover the following topics:
- Overview of preemptible RCU read-side code
- Steaming towards the trainwreck
- Fixes
- Current status
- Preventing future bellyflops and trainwrecks
Overview of preemptible RCU read-side code
Understanding the trainwreck requires reviewing a small amount of
TREE_PREEMPT_RCU
's read-side code.
First, let's look at __rcu_read_lock()
, which, in preemptible
RCU, does the real work for rcu_read_lock()
:
1 void __rcu_read_lock(void) 2 { 3 current->rcu_read_lock_nesting++; 4 barrier(); 5 }
This is quite straightforward: line 3 increments the per-task
->rcu_read_lock_nesting
counter and line 4 ensures
that the compiler does not bleed code from the following RCU read-side
critical section out before the __rcu_read_lock()
.
In short, __rcu_read_lock()
does nothing more than
to increment a nesting-level counter.
The __rcu_read_unlock()
function, which, in preemptible RCU,
does the real work for rcu_read_unlock()
, is only slightly more
complex:
1 void __rcu_read_unlock(void) 2 { 3 struct task_struct *t = current; 4 5 barrier(); 6 --t->rcu_read_lock_nesting; 7 barrier(); 8 if (t->rcu_read_lock_nesting == 0 && 9 unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) 10 rcu_read_unlock_special(t); 11 }
Line 5 prevents the compiler from bleeding code from the
RCU read-side critical section out past the __rcu_read_unlock()
,
line 6 decrements the per-task nesting-level counter, so that thus
far __rcu_read_unlock()
is the inverse of
__rcu_read_lock()
.
However, if the value of the nesting counter is now zero,
we now need to check to see if anything unusual happened
during the just-ended RCU read-side critical section,
which is the job of lines 8 and 9.
Line 7 prevents the compiler from moving this check to precede
the decrement on line 6 because otherwise something unusual might
happen just after the check but before the decrement, which would
in turn mean that __rcu_read_unlock()
would fail to
clean up after that unusual something.
The "unusual somethings" are:
- The RCU read-side critical section might have blocked or
been preempted.
In this case, the per-task
->rcu_read_unlock_special
variable will have theRCU_READ_UNLOCK_BLOCKED
bit set. - The RCU read-side critical section might have executed for
more than a jiffy or two.
In this case, the per-task
->rcu_read_unlock_special
variable will have theRCU_READ_UNLOCK_NEED_QS
bit set.
In either case, the per-task ->rcu_read_unlock_special
will be non-zero, so that __rcu_read_unlock()
will invoke
rcu_read_unlock_special()
, which we look at next:
1 static void rcu_read_unlock_special(struct task_struct *t) 2 { 3 int empty; 4 int empty_exp; 5 unsigned long flags; 6 struct rcu_node *rnp; 7 int special; 8 9 if (in_nmi()) 10 return; 11 local_irq_save(flags); 12 special = t->rcu_read_unlock_special; 13 if (special & RCU_READ_UNLOCK_NEED_QS) { 14 rcu_preempt_qs(smp_processor_id()); 15 } 16 if (in_irq()) { 17 local_irq_restore(flags); 18 return; 19 } 20 if (special & RCU_READ_UNLOCK_BLOCKED) { 21 t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED; 22 23 /* Clean up after blocking. */ 24 25 } 26 }
Lines 9 and 10 take an early exit if we are executing in
non-maskable interrupt (NMI) context.
The reason for this early exit is that NMIs cannot be interrupted
or preempted, so there should be no rcu_read_unlock_special()
processing required.
Otherwise, line 11 disables interrupts and line 12 takes a
snapshot of the per-task ->rcu_read_unlock_special
variable.
Line 13 then checks to see if the just-ended RCU read-side
critical section ran for too long, and, if so, invokes
rcu_preempt_qs()
to immediately record a quiescent state.
Recall that any point in the code that is not in an RCU
read-side critical section is potentially a quiescent state.
Therefore, since someone is waiting, report the quiescent state
immediately.
Lines 16 through 18 take an early exit if we are executing in a hardware interrupt handler. This is appropriate given that hardware interrupt handlers cannot block, so it is not possible to preempt or to block within an RCU read-side critical section running within a hardware interrupt handler. (Of course, threaded interrupt handlers are another story altogether.)
Finally, line 20 checks to see if we blocked or were
preempted within the
just-ended RCU read-side critical section, clearing the corresponding
bit and cleaning up after blockage or preemption if so.
The exact details of the cleanup are not important (and are therefore
omitted from the code fragment above), although curious
readers are referred to kernel.org
.
The important thing is what happens if this RCU read-side critical section
was the last one blocking an expedited RCU grace period or if the
just-ended RCU read-side critical section was priority-boosted.
Either situation requires that RCU interact with the scheduler, which
may require the scheduler to acquire its runqueue and priority-inheritance
locks.
Because the scheduler disables interrupts when acquiring the
runqueue and the priority-inheritance locks, an RCU read-side
critical section that lies entirely within one of these locks'
critical sections cannot be interrupted, preempted, or blocked.
Therefore, such an RCU read-side critical section should not
enter rcu_read_unlock_special()
, and should thus
avoid what would otherwise be an obvious self-deadlock scenario.
As we will see later, a number of self-deadlock scenarios can be
avoided via the
in_irq()
early exit from rcu_read_unlock_special()
.
Keep the critical importance of this early exit
firmly in mind as we steam down the tracks towards the
RCU/scheduler/threaded-irq trainwreck.
Steaming towards the trainwreck
Before we leave the station, please keep in mind thatin_irq()
can return inaccurate results because
it consults the preempt_count()
bitmask, which is updated
in software.
At the start of the interrupt, there is therefore a period of
time before preempt_count()
is updated to record the
start of the interrupt, during which time the interrupt handler
has started executing, but in_irq()
returns false.
Similarly, at the end of the interrupt, there is a period of
time after preempt_count()
is updated to record
the end of the interrupt, during which time the interrupt
handler has not completed executing, but again
in_irq()
returns false.
This last is most emphatically the case when the end-of-interrupt
processing kicks off softirq handling.
With that background, the sequence of commits leading to the trainwreck is as follows:
- In March of 2009, commit
a18b83b7ef
added the first knownrcu_read_unlock()
to be called while holding a runqueue lock.Quick Quiz 2: Suppose that an RCU read-side critical section is enclosed within a runqueue-lock critical section. Why couldn't that RCU read-side critical section be the last RCU read-side critical section blocking aTREE_PREEMPT_RCU
expedited grace period? AnswerQuick Quiz 3: Why can't we avoid this whole mess by treating interrupt-disabled segments of code as if they were RCU read-side critical sections? Answer
- In December of 2010, commit
d9a3da069
addedsynchronize_rcu_expedited()
toTREE_PREEMPT_RCU
, which causes the last reader blocking an expedited grace period to callwake_up()
from withinrcu_read_unlock()
. Of course, thewake_up()
acquires the runqueue locks.
Although this appears to open the door to an obvious deadlock scenario where the RCU read-side critical section under the runqueue lock is the last one blocking a preemptible-RCU expedited grace period, this cannot happen as long as the runqueue lock is held across the entire duration of the RCU read-side critical section.
Continuing down the tracks toward the trainwreck:
- In June of 2010, commit
f3b577dec1
added an RCU read-side critical section inwake_affine()
. Given that I was blissfully unaware of the true nature ofin_irq()
, I raised no objection to this patch. Quite the opposite, in fact, as can be seen by a quick glance at this commit. - The addition of threaded interrupt handlers meant that almost
all hardware interrupts started invoking the scheduler
in order to awaken the corresponding interrupt kthread,
which in turn increased the likelihood that
rcu_read_unlock_special()
would become confused by the return value fromin_irq()
. - Many more RCU read-side critical sections were added
within runqueue and priority-inheritance critical sections,
further increasing the interaction cross-section between RCU and the
scheduler.
-
RCU_BOOST
introduced an incorrect cross-task write to the per-task->rcu_read_unlock_special
variable. This could result in this variable being corrupted, resulting in all manner of deadlocks. This was fixed by commit7765be2fe
. - In addition,
RCU_BOOST
introduced another call from RCU into the scheduler in the form of art_mutex_unlock()
.
- An RCU read-side critical section is preempted, then resumes.
This causes the the per-task
->rcu_read_unlock_special
variable to have theRCU_READ_UNLOCK_BLOCKED
bit set. - This task remains preempted for so long that RCU priority boosting
is invoked.
- The RCU read-side critical section ends by invoking
rcu_read_unlock()
, which in in turn invokes the__rcu_read_unlock()
function shown above. - An interrupt arrives just after
__rcu_read_unlock()
reaches line 7. - The interrupt handler runs to completion, so that
irq_exit()
is invoked, andirq_exit()
decrements the irq nesting-level count to zero. - Then
irq_exit()
then invokesinvoke_softirq()
, which determines that ksoftirqd must be awakened. - The scheduler is invoked to awaken ksoftirqd, which acquires
a runqueue lock and then enters an RCU read-side critical
section.
- When the interrupt handler leaves the RCU read-side critical
section, line 9 of
__rcu_read_unlock()
will find that the per-task->rcu_read_unlock_special
variable is non-zero, and will therefore invokercu_read_unlock_special()
. - Because
in_irq()
returns false, line 16 ofrcu_read_unlock_special()
does not take an early exit. Therefore,rcu_read_unlock_special()
sees theRCU_READ_UNLOCK_BLOCKED
bit set in->rcu_read_unlock_special
, and also notes that the task has been priority boosted. It therefore invokes the scheduler to unboost itself. - The scheduler will therefore attempt to acquire a runqueue lock. Because this task already holds a runqueue lock, deadlock can (and sometimes did) result.
Fixes
The fixes applied to the RCU trainwreck are as follows:
-
b0d30417
(rcu: Prevent RCU callbacks from executing before scheduler initialized), which does what its name says. This addressed a few boot-time hangs. -
131906b0
(rcu: decrease rcu_report_exp_rnp coupling with scheduler), which causes RCU to drop one of its internal locks before invoking the scheduler, thereby eliminating one set of deadlock scenarios involving expedited grace periods. -
7765be2f
(rcu: Fix RCU_BOOST race handlingcurrent->rcu_read_unlock_special
), which allocates a separatetask_struct
field to indicate that a task has been priority boosted. This change meant that the->rcu_read_unlock_special
field returned to its earlier (and correct) status of being manipulated only by the corresponding task. This prevented a number of scenarios where an instance of__rcu_read_unlock()
invoked from interrupt context would incorrectly invokercu_read_unlock_special()
, which would again result in deadlocks. -
be0e1e21
(rcu: Streamline code produced by __rcu_read_unlock()), which was an innocent bystander brought along due to dependencies among patches. -
10f39bb1
(rcu: protect __rcu_read_unlock() against scheduler-using irq handlers), which rearranges__rcu_read_unlock()
's manipulation of->rcu_read_lock_nesting
so as to prevent interrupt-induced recursion in__rcu_read_unlock()
's invocation ofrcu_read_unlock_special()
, which in turn prevents another class of deadlock scenarios. This commit was inspired by an earlier patch by Steven Rostedt. -
c5d753a5
(sched: Add irq_{enter,exit}() toscheduler_ipi()
by Peter Zijlstra), which informs RCU that the scheduler is running. This is especially important when the IPI interrupts dyntick-idle mode: Without this patch, RCU would simply ignore any RCU read-side critical sections inscheduler_ipi()
. -
ec433f0c
(softirq,rcu: Inform RCU of irq_exit() activity by Peter Zijlstra), which informs RCU of scheduler activity that occurs from hardware interrupt level, but afterirq_exit()
has cleared thepreempt_count()
indication thatin_irq()
relies on. It is quite possible that10f39bb1
makes this change unnecessary, but proving that would have delayed 3.0 even more. -
a841796f
(signal: align __lock_task_sighand() irq disabling and RCU) fixes one case where an RCU read-side critical section is preemptible, but itsrcu_read_unlock()
is invoked with interrupts disabled. As noted earlier, there might be a broad-spectrum solution that renders this patch unnecessary, but that solution was not appropriate for 3.0.
Current status
The Linux 3.0 version of RCU finally seems stable, but the following potential vulnerabilities remain:
- In
RCU_BOOST
kernels, if an RCU read-side critical section has at any time been preemptible, then it is illegal to invoke itsrcu_read_unlock()
with interrupts disabled. There is an experimental patch that removes this restriction, but at the cost of lengthening the real-time mutex acquisition code path. Work continues to find a solution with better performance characteristics. - In all preemptible-RCU kernels, if an RCU read-side critical
section has at any time been preemptible, then it is illegal
to invoke its
rcu_read_unlock()
while holding a runqueue or a priority-inheritance lock. Although there are some possible cures for this condition, all currently known cures are worse than the disease.Quick Quiz 5: How could you remove the restriction on possibly-preempted RCU read-side critical sections ending with runqueue or priority-inheritance locks held? Answer -
TINY_PREEMPT_RCU
might well contain similar vulnerabilities.
Preventing future bellyflops and trainwrecks
Prevention is better than cure, so what preventative measures should be taken?
The most important preventative measure is to do a full review of the RCU code, documenting it as I go. In the past, I documented new RCU functionality as a matter of course, before that functionality was accepted into the kernel. However, over the past few years, I have gotten out of that habit. Although some of the bugs would probably have escaped me, I would likely have spotted a significant fraction. In addition, the documentation might have helped others better understand RCU, which in turn might have helped some of them to spot the bugs.
Although no one has yet reported similar bugs in
TINY_PREEMPT_RCU
, that does not mean that similar bugs
do not exist.
Therefore, when inspecting the code, I need to pay special attention
to the corresponding portions of TINY_PREEMPT_RCU
.
Another important preventative measure is to
question long-held assumptions.
My unquestioning faith in in_irq()
was clearly misplaced.
Although in_irq()
was
“good enough” for RCU for quite some time,
it suddenly was not.
In short, when you are working on something as low-level as RCU, you
shouldn't be taking things like this for granted.
Dealing with the trainwreck also exposed some shortcomings in my test setup, which emphasizes thoroughness over fast turnaround. Although there is no substitute for a heavy round of testing on a number of different configurations, it would be good to be able to validate debug patches and experimental fixes much more quickly. I have therefore started setting up an RCU testing environment using KVM. This testing environment also has the great advantage of working even when I don't have Internet access. Additionally, use of KVM looks like it will shorten the edit-compile-debug cycle, which is quite important when chasing bugs that I actually can reproduce.
Finally, I need to update my test configurations. Some of the bugs reproduce more quickly when threaded interrupt handlers are enabled, so I need to add these to my test regime. Another bug was specific to 32-bit kernels, which I currently don't test, but which KVM makes it easy to test. In fact, on my current laptop, 32-bit kernels are all that KVM is capable of testing.
Hopefully these changes will avoid future late-in-cycle RCU trainwrecks.
Acknowledgments
I am grateful to Steven Rostedt, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Ben Greear, Julie Sullivan, and Ed Tomlinson for finding bugs, creating patches, and lots of testing. I owe thanks to Jim Wasko for his support of this effort.
Answers to Quick Quizzes
Quick Quiz 1: But what about RCU read-side critical sections that begin before a runqueue lock is acquired and end within that lock's critical section?
Answer: That would be very bad. The scheduler is therefore forbidden from doing this.
Quick Quiz 2:
Suppose that an RCU read-side critical section is enclosed within
a runqueue-lock critical section.
Why couldn't that RCU read-side critical section
be the last RCU read-side critical section blocking a
TREE_PREEMPT_RCU
expedited grace period?
Answer:
No, it cannot.
To see why, note that
the TREE_PREEMPT_RCU
variant of
synchronize_rcu_expedited
is implemented in two phases.
The first phase invokes synchronize_sched_expedited()
,
which forces a context switch on each CPU.
The second phase waits for any RCU read-side critical sections that
were preempted in phase 1.
Because acquiring runqueue locks disables interrupts, it is not possible
to preempt an RCU read-side critical section that is totally enclosed
in a runqueue-lock critical section, and therefore
synchronize_rcu_expedited
will never wait on such
an RCU read-side critical section,
which in turn means that the corresponding rcu_read_unlock()
cannot have a need to invoke the scheduler, thus avoiding the deadlock.
Of course, the last link in the above chain of logic was broken by a later bug, but read on...
Quick Quiz 3: Why can't we avoid this whole mess by treating interrupt-disabled segments of code as if they were RCU read-side critical sections?
Answer: For two reasons:
- The fact that interrupt-disable sections of code act as
RCU read-side critical sections is a property of the
current implementation.
Later implementations are likely to need to do quiescent-state
processing off-CPU in order to reduce OS jitter, and such
implementations will not be able to treat interrupt-disable
sections of code as RCU read-side critical sections.
This property is important to a number of users, so much so
that there is an
out-of-tree RCU implementation
that provides it (see
here and
here
for more recent versions).
Therefore, we should be prepared for mainline Linux kernel's
RCU implementation to treat interrupt-disable sections of
code as the quiescent states that they really are.
- Having multiple very different things that provide read-side protection makes the code more difficult to maintain, with RCU-sched being a case in point.
Quick Quiz 4:
Exactly what vulnerability did commit f3b577dec1
expose?
Answer:
Suppose that an RCU read-side critical section is the last one blocking
an expedited grace period, and that its __rcu_read_unlock()
is interrupted just after it decrements the nesting count to zero.
The current->rcu_read_unlock_special
bitmask will
therefore be non-zero, indicating that special processing is required
(in this case, waking up the task that kicked of the expedited grace
period).
Suppose further that softirq processing is kicked off at the end of the
interrupt, and that there are so many softirqs pending that they
need to be handed off to ksoftirqd.
Therefore wake_up()
is invoked, which acquires the
needed runqueue locks.
But because wake_affine()
is invoked, there is
an RCU read-side critical section whose __rcu_read_unlock()
will see that current->rcu_read_unlock_special
is nonzero.
At this point, in_irq()
will be returning false,
so the resulting call to rcu_read_unlock_special()
won't know to take the early exit.
It will therefore invoke wake_up()
, which will again
attempt to acquire the runqueue lock, resulting in deadlock.
Quick Quiz 5: How could you remove the restriction on possibly-preempted RCU read-side critical sections ending with runqueue or priority-inheritance locks held?
Answer: Here are some possibilities:
- Enclose all
runqueue or priority-inheritance critical section
in an RCU read-side critical section.
This would mean that any
rcu_read_unlock()
that executed with one of these locks held would be inside the enclosing RCU read-side critical section, and thus would be guaranteed not to invokercu_read_unlock_special()
. However, this approach would add overhead to the scheduler's fastpaths and requires yet another odd hand-crafted handoff at context-switch time. - Keep some per-task state indicating that at least one
scheduler lock is held.
Then
rcu_read_unlock_special()
could set another per-task variable indicating that cleanup is required. The scheduler could check this flag when releasing its locks. I hope that the maintainability challenges of this approach are self-evident. - Your idea here.
Index entries for this article | |
---|---|
Kernel | Read-copy-update |
GuestArticles | McKenney, Paul E. |
Posted Jul 28, 2011 13:32 UTC (Thu)
by roblucid (guest, #48964)
[Link] (4 responses)
Posted Jul 28, 2011 14:36 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (3 responses)
Posted Jul 28, 2011 16:16 UTC (Thu)
by martinfick (subscriber, #4455)
[Link]
Posted Jul 28, 2011 20:43 UTC (Thu)
by roblucid (guest, #48964)
[Link] (1 responses)
Posted Jul 29, 2011 14:40 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Jul 30, 2011 3:45 UTC (Sat)
by pr1268 (subscriber, #24648)
[Link] (1 responses)
I'm curious, in static void rcu_read_unlock_special(struct task_struct *t), on line 5 unsigned long flags is declared but uninitialized. Later, on line 11, it is passed-by-value to another function. Is there some missing code not shown that initializes flags? Thank you for the neat article—I openly admit that a lot of it is hard for me to follow, but then I'm not as fluent in low-level kernel internals as Paul is. (And, I'm incredibly happy to know that people as knowledgeable as Paul are running the Linux Kernel Show™. ;-)
Posted Jul 30, 2011 5:19 UTC (Sat)
by cladisch (✭ supporter ✭, #50193)
[Link]
Not using a pointer is a little bit more efficient when accessing it from inline assembler.
Posted Jul 31, 2011 19:40 UTC (Sun)
by Julie (guest, #66693)
[Link]
We are very lucky to have Paul to unravel this ball of string and pick out all the burrs. We are also very lucky that he's great at explaining complex things in a clear, accessible way and that he's prepared to take time out to write another superb article.
Thanks for all your tireless and selfless efforts, Paul.
Posted Aug 4, 2011 22:38 UTC (Thu)
by dfsmith (guest, #20302)
[Link] (1 responses)
Posted Aug 13, 2011 23:02 UTC (Sat)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
3.0 and RCU: what went wrong
Such a contrast from commercial gloss & dissembly & political spin.
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
3.0 and RCU: what went wrong
Renaming in_irq()