Expanding the kernel stack
For most of the history of Linux, on most architectures, the kernel stack has been put into an 8KB allocation — two physical pages. As recently as 2008 some developers were trying to shrink the stack to 4KB, but that effort eventually proved to be unrealistic. Modern kernels can end up creating surprisingly deep call chains that just do not fit into a 4KB stack.
Increasingly, it seems, those call chains don't even fit into an 8KB stack on x86-64 systems. Recently, Minchan Kim tracked down a crash that turned out to be a stack overflow; he responded by proposing that it was time to double the stack size on x86-64 to 16KB. Such proposals have seen resistance before, and that happened this time around as well; Alan Cox argued that the solution is to be found elsewhere. But he seems to be nearly alone in that point of view.
Dave Chinner often has to deal with stack overflow problems, since they often occur with the XFS filesystem, which happens to be a bit more stack-hungry than others. He was quite supportive of this change:
Linus was unconvinced at the outset, and he made it clear that work on reducing the kernel's stack footprint needs to continue. But Linus, too, seems to have come around to the idea that playing "whack-a-stack" is not going to be enough to solve the problem in a reliable way:
Linus has also, unsurprisingly, made it clear that he is not interested in
changing the stack size in the 3.15 kernel. But the 3.16 merge window can
be expected to open in the near future; at that point, we may well see this
patch go in as one of the first changes.
Index entries for this article | |
---|---|
Kernel | Kernel stack |
Posted May 30, 2014 10:51 UTC (Fri)
by HIGHGuY (subscriber, #62277)
[Link] (10 responses)
Posted May 30, 2014 13:36 UTC (Fri)
by richard_weinberger (subscriber, #38938)
[Link] (9 responses)
Posted May 30, 2014 17:11 UTC (Fri)
by luto (subscriber, #39314)
[Link] (8 responses)
Posted May 30, 2014 18:41 UTC (Fri)
by PaXTeam (guest, #24616)
[Link] (7 responses)
Posted May 30, 2014 19:55 UTC (Fri)
by luto (subscriber, #39314)
[Link] (3 responses)
Posted May 30, 2014 20:49 UTC (Fri)
by PaXTeam (guest, #24616)
[Link] (2 responses)
another advantage is that vmalloc by its nature handles lowmem fragmentation much better which becomes even more important now that amd64 kstacks have become order-2 allocations. it'd also be easy to implement lazy page allocation for kstacks further reducing their memory consumption (let's face it, many kstacks will never actually make use of the whole 16k yet they'll always have to be fully allocated in the current scheme).
Posted May 30, 2014 20:52 UTC (Fri)
by luto (subscriber, #39314)
[Link]
Posted Jun 27, 2016 0:50 UTC (Mon)
by luto (subscriber, #39314)
[Link]
1. What do you do if lazy allocation fails?
2. Hitting a not-present page on the stack is likely to result in a double-fault. Intel's manual advises against trying to recover from a double-fault, and I'd like to know why before messing with it. Even if recovery were guaranteed to work, it could be interesting trying to allocate memory (which can block) in a double-fault handler.
The espfix64 code can double-fault and recover, but we ran that specific abuse of the CPU by some Intel and AMD engineers before doing it.
Posted Jun 1, 2014 8:35 UTC (Sun)
by richard_weinberger (subscriber, #38938)
[Link] (2 responses)
Posted Jun 1, 2014 12:38 UTC (Sun)
by PaXTeam (guest, #24616)
[Link] (1 responses)
Posted Jun 2, 2014 11:49 UTC (Mon)
by dgm (subscriber, #49227)
[Link]
Posted May 30, 2014 17:38 UTC (Fri)
by iabervon (subscriber, #722)
[Link]
Posted May 30, 2014 19:45 UTC (Fri)
by corbet (editor, #1)
[Link] (2 responses)
Posted May 30, 2014 19:56 UTC (Fri)
by smitty_one_each (subscriber, #28989)
[Link] (1 responses)
Posted May 30, 2014 20:05 UTC (Fri)
by boog (subscriber, #30882)
[Link]
Posted May 30, 2014 21:35 UTC (Fri)
by parcs (guest, #71985)
[Link] (10 responses)
Posted May 30, 2014 21:49 UTC (Fri)
by nevets (subscriber, #11875)
[Link] (9 responses)
A dynamic stack would be incredibly complex to implement. What happens when you need more stack? You would need to make sure the task faults when it overflows, and then the fault handler would require a separate stack. Where do you allocate the next page from? Oh, and as the stack must be continuous, the stack must be mapped into virtual memory. Currently, all kernel memory (except things like modules and stuff allocated with vmalloc) is mapped in huge page tables, and the kernel stack is just a pointer within that mapping.
The kernel stack doesn't have to be a fixed size, but the alternatives are much worse.
Posted May 31, 2014 22:28 UTC (Sat)
by sdalley (subscriber, #18550)
[Link] (2 responses)
Posted May 31, 2014 23:06 UTC (Sat)
by dlang (guest, #313)
[Link] (1 responses)
you have the task info, then the scheduler puts some info there, then you access memory and that does a page fault, finds that you need to interact with swap, makes a call to access the disk, which may need to go through raid, lvm, union mounts and then the filesystem needs it's data.....
In the case that triggered this, you didn't have a complex storage layer, you had a compile option that ate some space and a lot of cases where gcc decided to put variable data on the stack, and did so particularly inefficiently as well.
But the task doing the work that triggered this mess wasn't doing anything special, so there's no way to size the stack per task.
Posted Jun 1, 2014 10:55 UTC (Sun)
by sdalley (subscriber, #18550)
[Link]
Indeed, one can't afford to assume low stack usage if one don't know ahead of time whether a task might need lots of stack, even if very infrequently.
Posted Jun 2, 2014 17:21 UTC (Mon)
by jhoblitt (subscriber, #77733)
[Link] (5 responses)
Posted Jun 4, 2014 18:57 UTC (Wed)
by dtlin (subscriber, #36537)
[Link] (4 responses)
GCC -fsplit-stacks works for C/C++ code in user-space now, at least on i386/x86_64 Linux. I'm not sure how hard it would be to get it working in the kernel, but at a glance it looks non-trivial: the implementation depends on -fuse-ld=gold, which doesn't work with the kernel; the generated code uses the __private_ss slot in the %gs/%fs TCB, which would presumably have to change to access something in task_struct *current instead; and __morestack uses mmap to allocate new stack segments, which won't work (and there isn't one obvious way to safely allocate memory in kernel context).
For Go, split stacks were problematic (performance-wise) and 1.3 will switch to reallocating contiguous stacks. Since it involves moving the stack (and thus changing the addresses of everything that's on the stack), it's probably not doable for C/C++. Well, I suppose you could dynamically allocate anything that escapes (like Go does), but that seems pretty invasive…
Posted Jun 4, 2014 20:41 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Also for performance reasons.
Posted Jun 7, 2014 5:38 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Posted Jun 7, 2014 10:52 UTC (Sat)
by PaXTeam (guest, #24616)
[Link] (1 responses)
Posted Jun 7, 2014 11:24 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted May 31, 2014 9:21 UTC (Sat)
by mslusarz (guest, #58587)
[Link] (3 responses)
Posted May 31, 2014 11:35 UTC (Sat)
by corbet (editor, #1)
[Link] (2 responses)
Posted Jun 1, 2014 0:51 UTC (Sun)
by dgc (subscriber, #6611)
[Link] (1 responses)
-Dave.
Posted Jun 1, 2014 10:54 UTC (Sun)
by khim (subscriber, #9252)
[Link]
Sounds like a bug to me. It looks like it should be possible to do such switch “for cheap” (i.e.: by changing a few data structures without context switches), but this will require special-casing (effectively this will mean that this special thread will be executed on it's own kernel stack with with “borrowed” userspace) and tricky additional manipulations. Still it may be preferable to endlessly growing kernel stack.
Expanding the kernel stack
Seems like a nice middle-ground between not crashing while also not ignoring the need to fix offenders.
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Would be awesome. :-)
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Well, I was wrong about one thing...Linus just merged the 16K stack patch for 3.15.
Merged for 3.15
Merged for 3.15
Merged for 3.15
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack
That is essentially what is done in a number of places — work is shifted to a kernel thread, which has the effect of going to a different stack (one that is known not to be almost full already). Doing it any other way involves controlling access to some sort of shared stack infrastructure; that would add a lot of unwelcome complexity, to say the least.
Expanding the kernel stack
Expanding the kernel stack
Expanding the kernel stack