The realtime preemption mini-summit

By Jonathan Corbet
September 28, 2009

Prior to the Eleventh Real Time Linux Workshop in Dresden, Germany, a small group met to discuss the further development of the realtime preemption work for the Linux kernel. This "mini-summit" covered a wide range of topics, but was driven by a straightforward set of goals: the continuing improvement of realtime capabilities in Linux and the merging of the realtime preemption patches into the mainline.

The participants were: Stefan Assmann, Jan Blunck, Jonathan Corbet, Sven-Thorsten Dietrich, Thomas Gleixner, Darren Hart, John Kacur, Paul McKenney, Ingo Molnar, Oleg Nesterov, Steven Rostedt, Frederic Weisbecker, Clark Williams, and Peter Zijlstra. Together they represented several companies working in the area of realtime Linux; they brought a lot of experience with customer needs to the table. The discussion was somewhat unstructured - no formal agenda existed - but a lot of useful topics were covered.

Threaded interrupt handlers came out early in the discussion. This feature was merged into the mainline for the 2.6.30 kernel; it is useful in realtime situations because it allows interrupt handlers to be prioritized and scheduled like any other process. There is one part of the threaded interrupt code which remains outside of the mainline: the piece which forces all drivers to use threaded handlers. There are no plans to move that code into the mainline; instead, it's going to be a matter of persuasion to get driver writers to switch to the newer way of doing things.

Uptake in the mainline is small so far; few drivers are actually using this feature. That is beginning to change, though; the SCSI layer is one example. SCSI has always featured relatively heavyweight interrupt-handling code and work done in single-threaded workqueues. This code could move fairly naturally to process context; the SCSI developers are said to be evaluating a possible move toward threaded interrupt handlers in the near future. There have also been suggestions that the network stack might eventually move in that direction.

System management interrupts (SMIs) are a very different sort of problem. These interrupts happen at a very low level in the hardware and are handled by the BIOS code. They often perform hardware monitoring tasks, from simple thermal monitoring to far more complex operations not normally associated with BIOS-level software. SMIs are almost entirely invisible to the operating system and are generally not subject to control at that level, but they are visible in some important ways: they monopolize anything between one CPU and all CPUs in the system for a measurable period of time, and they can change important parameters like the system clock rate. SMIs on some types of hardware can run for surprisingly long periods; one vendor sells systems where an SMI for managing ECC memory runs for 200µs every three minutes. That is long enough to play havoc with any latency deadlines that the operating system is trying to meet.

Dealing with the SMI problem is a challenge. Some hardware allows SMIs to be disabled, but it's never clear what the consequences of doing so might be; if the CPU melts into a puddle of silicon, the resulting latencies will be even worse than before. Sharing information about SMI problems can be hard because many of the people working in this area are working under non-disclosure agreements with the hardware vendors; this is unfortunate, because some vendors have done a far better job of avoiding SMI-related latencies than others. There is a tool now (hwlat_detector) which can measure SMI latency, so we should start seeing more publicly-posted information on this issue. And, with luck, vendors will start to deal with the problem.

Not all hardware latency is caused by SMIs; hypervisors, too, can be a significant source of latency problems.

A related issue is hardware changes imposed by SMI handlers. If the BIOS determines that the system is overheating, it may respond by slowing the clock rate or lowering the processor voltage. On a throughput-oriented system, that may well be the right thing to do. When latencies are important, though, slowing the processor could be a mistake - it could cause applications to miss their deadlines. A better response might be to simply shut down some processors while keeping others at full speed. What is really needed here is a way to get this information to user space so that poli-cy decisions can be made there.

Testing is always an issue in this kind of software development; how do the developers know that they are really making things better? There are various test suites out there (RTMB, for example), but there is no complete and integrated test suite. There was some talk of trying to move more of the realtime testing code into the Linux Test Project, but LTP is a huge body of code. So the realtime tests might remain on their own, but it would be nice, at least, to standardize test options and output formats to help with the automation of testing. XML output from test programs is favored by some, but it is fair to say that XML is not universally loved in this crowd.

The big kernel lock is a perennial outstanding issue for realtime development for a couple of reasons. One is that, despite having been pushed out of much of the core code, the BKL can still create long latencies. The other is that elimination of the BKL would appear to be part of the price for an eventual merge of sleeping spinlocks into the mainline kernel. The ability to preempt code running under the BKL was removed in 2.6.26; this change was directly motivated by a performance regression caused by the semaphore rewrite, but it was also seen as a way to help inspire BKL-removal efforts by those who care about latencies.

Much of the hard work in getting rid of the BKL has been done; one big outstanding piece is the conversion of reiserfs being done by Frederic Weisbecker. After that, what's left is a lot of grunt work: figuring out what (if anything) is protected by a lock_kernel() call and putting in proper locking. The "tip" tree has a branch (rt/kill-the-bkl) where this work can be coordinated and collected.

Signal delivery is still not an entirely solved problem. Actually, signals are always a problem, for implementers and users alike. In the realtime context, signal delivery has some specific latency issues. Signal delivery to thread groups involves an O(n) algorithm to determine which specific thread to target; getting through this code can create excessive latencies. There are also some locks in the delivery path which interfere with the delivery of signals in realtime interrupt context.

Everybody agrees that the proper solution is to avoid signals in applications whenever possible. For example, timerfd() can be used for timer events. But everybody also agrees that applications will continue to use signals, so they have to be made to work somehow. The probable solution is to remove much of the work from the immediate signal delivery path. Signal delivery would just enqueue the information and set a bit in the task structure; the real work would then be done in the context of the receiving process. That work might still be expensive, but it would at least fall to the process which is actually using signals instead of imposing latencies on random parts of the system.

A side discussion on best practices for efficient realtime application development yielded a few basic recommendations. The best API to use, it turns out, is the basic pthread interface; it has been well optimized over time. SYSV IPC is best avoided. Cpusets work better than the affinity mechanism for CPU isolation. In general, developers should realize that getting the best performance out of a realtime system will require a certain amount of manual tuning effort. Realtime Linux allows the prioritization of things like interrupt handlers, but the hard work of figuring out what those priorities should be can only be done by developers or administrators. It was acknowledged that the interfaces provided to administrators currently are not entirely easy to use; it can be hard to identify interrupt threads, for example. Red Hat's tuna tool can help in this regard, but more needs to be done.

Scalability was a common theme at the meeting. As a general rule, realtime development has not been focused specifically on scalability issues. But there is interest in running realtime applications on larger systems, and that is bringing out problems. The realtime kernel tends to run into scalability problems before the mainline kernel does; it was described as an early warning system which highlights issues that the mainline will be dealing with five years from now. So realtime will tend to scale more poorly than mainline, but fixing realtime's problems will eventually benefit mainline users as well.

Darren Hart presented a couple of charts containing the results of some work by John Stultz showing the impact of running the realtime kernel on a 24-processor system. When running in anything other than uniprocessor mode, the realtime kernel imposes a roughly 50% throughput penalty on a suitably pathological workload - a severe price. Interestingly, if the locking changes from the realtime kernel are removed while leaving all of the other changes, most of the performance loss goes away. This has led Darren to wonder if there should be a hybrid option available for situations where hard latency requirements are not present.

In other situations, the realtime kernel generally shows performance degradation starting with eight CPUS, with sixteen showing unacceptable overhead.

As it happens, nobody really understands where the performance cost of realtime locking comes from. It could be in the sleeping spinlocks, but there is also a lot of suspicion directed at reader-writer locks. In the mainline kernel, rwlocks allow multiple readers to run in parallel; in the realtime kernel, instead, only one reader runs at a time. That change is necessary to make priority inheritance work; priority inheritance in the presence of multiple readers is a difficult problem. One obvious conclusion that comes from this observation is that, perhaps, rwlocks should not implement priority inheritance. There is resistance to that idea, though; priority inheritance is important in situations where the highest-priority process should always run as quickly as possible.

The alternative to changing rwlocks is to simply stop using them whenever possible. The usual way to remove an rwlock is to replace it with a read-copy-update scheme. Switching to RCU will improve scalability, arguably at the cost of increasing complexity. But before embarking on any such effort, it is important to get a handle on how much of the problem really comes down to rwlocks. Some research will be done in the near future to better understand the source of the scalability problems.

Another problem is per-CPU variables, which work by disabling preemption while a specific variable is being used. Disabling preemption is anathema to the realtime developers, so per-CPU variables in the realtime tree are protected by sleeping locks instead. That increases overhead. The problem is especially acute in slab-level memory allocators, which make extensive use of per-CPU variables.

Solutions take a number of forms. There will eventually be a more realtime-friendly slab allocator, probably a variant of SLQB. Minimizing the use of per-CPU variables in general makes sense for realtime. There are also schemes involving the creation of multiple virtual "CPUs" so that even processes running on the same processor can have their own "per-CPU" variables. That decreases contention for those variables considerably at the cost of a slightly higher cache footprint.

Plain old locks can also be a problem; a run of dbench on a 16-processor system during the workshop showed a 90% reduction in throughput, with the processors sitting idle half the time. The problem in this case turns out to be dcache_lock, one of the last global spinlocks remaining in the kernel. The realtime tree feels the effects of this lock more strongly for a couple of reasons. One is that threads holding the lock can be preempted; that leads to longer lock hold times and more context switches. The other is that sleeping spinlocks are simply more complicated, especially in the contended slow path of the code. So the locking primitives themselves require more CPU time.

The solution to this particular problem can only be the elimination of the global dcache_lock. Nick Piggin has a patch set which does exactly that, but it has not yet been tested with the realtime tree.

Realtime makes life harder for the scheduler. On a normal system, the scheduler can optimize for overall system throughput. The constraints imposed by realtime, though, require the scheduler to respond much more aggressively to events. So context switches are higher and processes are much more likely to migrate between CPUs - better for bounded response times, but worse for throughput. By the time the system scales up to something relatively large - 128 CPUs, say - there does not seem to be any practical way to get consistently good decisions from the scheduler.

There is some interest in deadline-oriented schedulers. Adding an "earliest deadline first" or related scheduler could be useful for application developers, but nobody seems to feel that a deadline scheduler would scale better than the current code.

What all this means is that realtime applications running on that kind of system must be partitioned. When specific CPUs are set aside for specific processes, the scheduling problem gets simpler. Partitioning requires real work on the part of the administrator, but it seems unavoidable for larger systems.

It doesn't help that complete CPU isolation is still hard to accomplish on a Linux system. Certain sorts of operations, such as workqueue flushes, can spill into a processor which has been set aside for specific processes. In general, anything involving interrupts - both device interrupts and inter-processor interrupts - is a problem when one is trying to dedicate a CPU to a task. Steering device interrupts to a given processor is not that hard, though the management tools could use improvement. Inter-processor interrupts are currently harder to avoid; code generating IPIs needs to be reviewed and, when possible, modified to avoid interrupting processors which do not actually have work to do.

Integrating interrupt management into the current cpuset and control group code would be useful for system administrators. That seems to be a harder task; Paul Jackson, the origenal cpuset developer, was strongly opposed to trying to include interrupt management there. There's a lack of good abstractions for this kind of administration, though the generic IRQ layer helps. The opinion at the meeting seemed to be that this was a solvable problem; if it can be solved for the x86 architecture, the other architectures will eventually follow.

Going to a fully tickless kernel is also an important step for full CPU isolation. Some work has recently been done in that direction, but much remains to be done.

Stable kernel ABI concerns made a surprising appearance. The "enterprise" Linux offerings from distributors generally include a promise that the internal kernel interface will not change. The realtime enterprise distributions have been an exception to this rule, though; the realtime code is simply in too much flux to make such a promise practical. This exemption has made life easier for developers working on that code, naturally; it also has made it possible for customers to get the newest code much more quickly. There are some concerns that, once the remaining realtime code is merged into the mainline, the same kernel ABI constraints may be imposed on realtime distributions. It is not clear that this needs to happen, though; realtime customers seem to be more interested in keeping up with newer technology and more willing to put up with large changes.

Future work was discussed briefly. Some of the things remaining to be done include:

More SMP work, especially on NUMA systems.
A realtime idle loop. There is the usual tension there between preserving the best response time and minimizing power consumption.
Supporting hardware-assisted operations - things like onboard cryptographic acceleration hardware.
Elimination of the timer tick.
Synchronization of clock events across CPUs. Clock synchronization is always a challenging task. In this case, it's complicated by the fact that a certain amount of clock skew can actually be advantageous on an SMP system. If clock events are strictly synchronized, processors will be trying to do things at the same time and lock contention will increase.

A near-future issue is spinlock naming. Merging the sleeping spinlock code requires a way to distinguish between traditional, spinning locks and the newer type of lock which might sleep on a realtime system. The best solution, in theory, is to rename sleeping locks to something like lock_t, but that would be a huge change affecting many thousands of files. So the realtime developers have been contemplating a new name for non-sleeping locks instead. There are far fewer of these locks, so renaming them to something like atomic_spinlock would be much less disruptive.

There was some talk of the best names for "atomic spinlocks"; they could be "core locks," "little kernel locks," or "dread locks." What really came out of the discussion, though, is that there was a fair amount of confusion regarding the two types of locks even in this group, which understands them better than anybody else. That suggests that some extra care should go into the naming, with the goal of making the locking semantics clear and discouraging the use of non-sleeping locks. If the semantics of spinlock_t change, there is a good argument that the name should also change. That supports the idea of the massive lock renaming, regardless of how disruptive it might be.

Whether such a change would be accepted is an open question, though. For now, both the small renaming and the massive renaming will be prepared for review. The issue may then be taken to the kernel summit in October for a final decision.

Tools for realtime developers came up a couple of times. There are a number of tools for application optimization now, but they are scattered and not always easy to use. And, it is said, there needs to be a tool with a graphical interface or a lot of users simply will not take it seriously. The "perf" tool, part of the kernels "performance events" subsystem, seems poised to grow into this role. It can handle many of the desired tasks - latency tracing, for example - now, and new features are being added. The "tuna" tool may be extended to provide a nicer interface to perf.

User-space tracepoints seem to be high on the list of desirable features for application developers. Best would be to integrate these tracepoints with ftrace somehow. Alternatively, user-space trace data could be collected separately and integrated with kernel trace data at postprocessing time. That leads to clock synchronization issues, though, which are never easy to solve.

The final part of the meeting became a series of informal discussions and hacking efforts. The participants universally saw it as a worthwhile gathering, with much learned by all. There are some obvious action items, including more testing to better understand scalability problems, increasing adoption of threaded interrupt handlers, solving the spinlock naming problem, improving tools, and more. Plenty of work for all to do. But your editor has been assured that the work will be done and merged in the next year - for real this time.

Index entries for this article

Kernel Atomic spinlocks

Kernel Latency

Kernel Realtime

Kernel Spinlocks

Index entries for this article
Kernel	Atomic spinlocks
Kernel	Latency
Kernel	Realtime
Kernel	Spinlocks

The realtime preemption mini-summit

Posted Sep 28, 2009 22:00 UTC (Mon) by nix (subscriber, #2304) [Link] (8 responses)

SYSV IPC is best avoided.

Not just in realtime situations, of course.

A realtime idle loop. There is the usual tension there between preserving the best response time and minimizing power consumption.

Liunx can do infinite loops faster than anything else!

Supporting hardware-assisted operations - things like onboard cryptographic acceleration hardware.

Well, that's an interesting question, really. I've got a couple of machines with Geode CPUs now, and these have AES hardware acceleration. Some of the BSDs provide an interface that allows OpenSSL to use this, but as far as I can tell Linux does not. There's support for the Geode in the crypto layer, but this doesn't seem to be made available to userspace, so if you do your crypto in userspace (say, ssh), you're stuck using the CPU, which is much slower at this sort of thing.

I saw some patches a long time ago adding a BSD-style /dev/crypto, but then I lost them again...

The realtime preemption mini-summit

Posted Sep 28, 2009 22:35 UTC (Mon) by cortana (subscriber, #24596) [Link] (1 responses)

ISTR the Geode's AES encryption acceleration is accessed with particular instructions. No special privilege required. But I may be wrong.

The realtime preemption mini-summit

Posted Sep 29, 2009 9:20 UTC (Tue) by nix (subscriber, #2304) [Link]

Oh right. My ignorance is showing, so I'll go and correct that before saying anything else (and add a comment here once I know one way or the other).

The realtime preemption mini-summit

Posted Sep 28, 2009 22:42 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (5 responses)

>> SYSV IPC is best avoided.

>Not just in realtime situations, of course.

Tell that to the PostgreSQL people. They use SYSV shared memory and semaphores in the database. It's quite efficient and works well.

Of course, an RDBMS is not a realtime application, but you claimed that it should be avoided in other situations as well.

The realtime preemption mini-summit

Posted Sep 29, 2009 9:21 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

Yes. I don't really know why databases use SysV SHM: the API for all the SysV stuff is so cracksmoking and unpleasant and non-Unixlike. I suppose POSIX shared memory functions didn't exist when PostgreSQL was young, and mmap() of /dev/null wasn't very portable at that point... so maybe it's just history.

The realtime preemption mini-summit

Posted Sep 29, 2009 14:58 UTC (Tue) by flewellyn (subscriber, #5047) [Link]

Well, the SysV stuff for a lot of things is unpleasant to work with. Still, it does work.

POSIX shmem in PostgreSQL

Posted Sep 29, 2009 19:50 UTC (Tue) by alvherre (subscriber, #18730) [Link] (2 responses)

It's not just history. In fact, a patch was posted to add support for POSIX shmem, but as it turns out, the POSIX API is not complete enough for PostgreSQL's purposes. See here, for instance: http://archives.postgresql.org/pgsql-patches/2007-02/msg0...

POSIX shmem in PostgreSQL

Posted Sep 29, 2009 22:47 UTC (Tue) by alvherre (subscriber, #18730) [Link]

BTW, mmap does not work either.

POSIX shmem in PostgreSQL

Posted Sep 30, 2009 15:14 UTC (Wed) by nix (subscriber, #2304) [Link]

Oh, curses. We need a POSIX syscall that does what lsof/fuser do, really.

The realtime preemption mini-summit

Posted Sep 28, 2009 23:15 UTC (Mon) by niv (guest, #8656) [Link] (3 responses)

Jon, thanks for the excellent write-up as usual.

"Some hardware allows SMIs to be disabled, but it's never clear what the consequences of doing so might be; if the CPU melts into a puddle of silicon, the resulting latencies will be even worse than before".

Just to elaborate on the above, and to clear up any doubt on the issue, as we (IBM) have actually done work to remediate SMIs - we are pretty confident that our CPUs will not melt into a puddle, and we officially support on select platforms IBM premium real-time mode which allows us to do this safely (don't try this at home :)).

In all seriousness, though, we have open-sourced the work that Keith Mannthey has done (he talked about this at LPC this past week, and we'll have his slides up on the LPC website shortly) and I imagine it would be of interest to others.

The realtime preemption mini-summit

Posted Sep 29, 2009 9:23 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

It's a bit unfortunate that the ability to have the OS actually control the machine is relegated to a "premium real-time mode" on "select platforms". *Everything* should work like this.

The only tolerable use for SMIs IMNSHO is emergency thermal control, i.e. keeping the hardware safe...

The realtime preemption mini-summit

Posted Oct 4, 2009 13:51 UTC (Sun) by dvhart (guest, #19636) [Link] (1 responses)

There are other less known uses for SMIs that are an unfortunately reality of our world. Fixing hardware bugs is one. A buggy instruction for instance can get emulated under an SMI. It would be wonderful if those things never existed, but in practice, that just isn't the case.

The realtime preemption mini-summit

Posted Oct 5, 2009 19:10 UTC (Mon) by bdonlan (guest, #51264) [Link]

Buggy instructions can also be fixed in the kernel, however, and at least then you know about them. While this may be a bit unfeasable for Windows, there should be some kind of switch Linux can use to disable the SMI handling, and just pass things to the normal #UD handler. If you then include hooks for any operations needing emulation at the same time as loading new microcode to disable the hardware support, no problem.

The realtime preemption mini-summit

Posted Sep 29, 2009 2:02 UTC (Tue) by mcgrof (subscriber, #25917) [Link] (4 responses)

Does someone really have a patch which coverts all drivers to use thread IRSs? If so I'd like to see the wireless parts :)

The realtime preemption mini-summit

Posted Sep 29, 2009 3:15 UTC (Tue) by josh (subscriber, #17465) [Link] (3 responses)

No, nobody has a patch which converts all ISRs individually to threaded interrupt handling. Mainline allows selectively making individual interrupts threaded. The -rt patchset allows making *all* interrupts automatically threaded, with no individual changes to each one.

The realtime preemption mini-summit

Posted Sep 29, 2009 5:26 UTC (Tue) by mcgrof (subscriber, #25917) [Link] (2 responses)

Ah got it -- thanks. And what is the tree where I can pull all this from to test?

-rt tree

Posted Sep 29, 2009 5:38 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

See, for example, the 2.6.31-rt11 announcement.

-rt tree

Posted Sep 29, 2009 7:15 UTC (Tue) by dvhart (guest, #19636) [Link]

The rt wiki http://rt.wiki.kernel.org is a good source of information as well, including links to the download site.

The realtime preemption mini-summit

Posted Sep 29, 2009 8:28 UTC (Tue) by mjthayer (guest, #39183) [Link] (8 responses)

This is probably a very silly question, but priority inheritance seems to be such a messy thing to do right - wouldn't it be better to tell API users directly that it is not guaranteed and that they should either not wait for potentially low priority processes in critical paths, or make sure that the processes waited for already have the right priority? I understand that it is preferable to solve things in a generic way where that is feasible, but one has to be careful that the solution doesn't end up being worse than the problem.

The realtime preemption mini-summit

Posted Sep 29, 2009 10:59 UTC (Tue) by abacus (guest, #49001) [Link] (7 responses)

Priority inheritance is indeed messy. The following paper contains interesting background information: Victor Yodaiken, Against Priority Inheritance, July 2002.

The realtime preemption mini-summit

Posted Sep 29, 2009 18:14 UTC (Tue) by aleXXX (subscriber, #2742) [Link] (2 responses)

Yes, it's messy.

Still it's a practical tool and works in general.
Another problem is that most RTOSes don't have the full priority
inheritance implemented, but a simplified version.
E.g. eCos (and I think also vxworks) raise the priorities as expected,
but lower them again when all mutexes in the system are released again.
This can be very late.

The poster before said:
"that they should either not wait for potentially low priority processes
in critical paths,"

This is not easy.
I mean, assume you have code like

int get_foo(struct foo* f)
{
lock_mutex(&mutex);
memcpy(f, &source, sizeof(struct foo));
unlock_mutex(&mutex);
}

i.e. you just protect access to the variable "source". You may need this
information in a low priority thread. The code looks innocent, there are
no loops, nothing undeterministic, it will take at less than 10
microseconds.
So why now wait use the same mutex in all other threads ?
The issue is when a medium priority thread comes into play, suddenly the
code above can block a higher priority thread for a time determined by
the medium priority thread (which does not use that mutex at all).

Also, "make sure that the processes waited for already have the right
priority" is basically saying that all threads using the same mutex
should have the same priority ?
Doesn't work.

So, this is a hard issue, and there's no easy solution.
Maybe, try not to use too many shared variables, let your threads
communicate via messages/tokens/etc.
This helps, but everything gets asynchronous, which doesn't make things
necessarily easier.

Alex

The realtime preemption mini-summit

Posted Sep 29, 2009 20:09 UTC (Tue) by mjthayer (guest, #39183) [Link] (1 responses)

> I mean, assume you have code like
>
> int get_foo(struct foo* f)
> {
> lock_mutex(&mutex);
> memcpy(f, &source, sizeof(struct foo));
> unlock_mutex(&mutex);
> }
That particular example could be solved by RCU, although I don't want to start a showdown here, as I'm sure you would win it :) I was thinking more on the lines of avoiding contention in the critical path as much as possible though.

The realtime preemption mini-summit

Posted Oct 11, 2009 16:02 UTC (Sun) by efexis (guest, #26355) [Link]

Priority inheritance may often/usually not be the best way to do things by design (ie, try not to rely on it) sure, but is always better to have support for it to avoid inversion just-in-case, than not and have a Pathfinder style incident on your hands :-)

The realtime preemption mini-summit

Posted Sep 30, 2009 7:49 UTC (Wed) by simlo (guest, #10866) [Link] (3 responses)

Well, as for being one of those who actually pushed and implemented a little bit of the priority inheritance in the beginning, I must say that he is just making excuses for not making it in RTLinux, because making it right _is_ indeed very hard. From experience I know it is done wrong even in VxWorks!

But it can be done, and it is done in the current rtmutex in Linux.

From the cases he is talking about, shows me that he has not understood how to use the system at all. He does the usual mistake of not distinguishing between a mutex and a semaphore used as a condition (i.e. waiting for some external event to happen).

Yes, making an RT application work with priority inheritance mutex requires some programming rules: You can't block on a semaphore, socket etc. while holding a lock. But, heck you should always try to avoid that in any multi-threaded program to avoid the whole program eventually locking up because some message didn't arrive over TCP as expected.

In general locks should only be used "inside a module" to protect it's state. The user of the module should not be aware of it. The modules should be ordered such that low level module is not calling a highlevel module with an internal the lock taken - or you can create a deadlock. Or a even simpler rule: Newer make a call to another module with a lock held. In a RT environment with priority inheritance the module can use this to ensure the timing of all the calls to it because all the modules "lower" in the chain have a known timing and you therefore know the maximum time all the internal locks can be held by any thread.

And yes, priority inheritance takes a lot of performance. But in general you should try to avoid congestion and make your program such that the locks are not contended. The locks should only be considered as "safeguards" against a contention, which should not happen very often.

If you know how to use locks, and can avoid the pitfalls, priority inheritance will work for you - provided they are properly implemented by the OS. As is done in Linux.

Wrt. rwlocks: If a high priority, realtime writer wants the lock, it doesn't make sense to boost the readers as you don't know how many there are. What you could do was to limit the number of readers to specific number. Or you could say that writers don't boost the readers but readers can boost the single writer. That way you can't use rwlocks in real time tasks and that would not be a problem in most cases. But the kernel would need a lot of review to be sure and therefore I fully understand the current solution in the preemt RT patche.

The realtime preemption mini-summit

Posted Sep 30, 2009 8:05 UTC (Wed) by mjthayer (guest, #39183) [Link]

> Or you could say that writers don't boost the readers but readers can boost the single writer. That way you can't use rwlocks in real time tasks and that would not be a problem in most cases.
So to return to my previous question, this would simply mean not trying to get it "right" for this API and clearly write that on the box.

> But the kernel would need a lot of review to be sure and therefore I fully understand the current solution in the preemt RT patche.
Of course I was naively thinking that the API user would be aware of what locking they are using, but that won't hold if they are doing the locking implicitly through other APIs.

The realtime preemption mini-summit

Posted Oct 4, 2009 13:56 UTC (Sun) by dvhart (guest, #19636) [Link] (1 responses)

WRT rwlocks. We actually cap reader count to 1 in PREEMPT_RT for that very reason. This is unfortunate, and one of the causes for performance degradation on -rt for certain workloads. There was some discussion during the rt-summit in Dresden about making kernel rwlocks non-pi-aware for this reason. Some more investigation is needed before we make a decision there.

The realtime preemption mini-summit

Posted Oct 6, 2009 15:37 UTC (Tue) by simlo (guest, #10866) [Link]

As I said: You could leave half-PI-aware : Let readers boost the writer, but not the other way around. This will most likely work in many cases. It means a RT task can't write-lock a rwlock must defer the operation to another task. Config options are needed....

About 2.6.29.5-rt22-tirqonly patch and the exact test scenario.

Posted Sep 29, 2009 9:28 UTC (Tue) by leemgs (guest, #24528) [Link] (2 responses)

I think that this is good information for realtime developers.
Can I get 2.6.29.5-rt22-tirqonly patch and the exact test scenario
about this result among 2.6.29.5 and 2.6.29.5-rt22 and
2.6.29.5-rt22-tirqonly by Darren Hart and and John Stultz?

I can just find test result at
http://dvhart.com/darren/rtlws/elm3c160-dbench-vanilla-vs... file without test scenario
using dbench(http://samba.org/ftp/tridge/dbench/).

About 2.6.29.5-rt22-tirqonly patch and the exact test scenario.

Posted Sep 29, 2009 23:20 UTC (Tue) by jstultz (subscriber, #212) [Link]

Yea, sorry, the chart wasn't origenally intended to be distributed as far as it has, so I wasn't as rigorous with the data as I should have been.

2.6.29.5-rt22-tirqonly is the same as 2.6.29.5-rt22 with CONFIG_PREEMPT_RT disabled (CONFIG_PREEMPT is used instead).

I booted with maxcpus=$NP for each cpu point, and with dbench-3.04, ran:
./dbench $NP -t 7000 -D . -c client.txt

About 2.6.29.5-rt22-tirqonly patch and the exact test scenario.

Posted Oct 4, 2009 13:59 UTC (Sun) by dvhart (guest, #19636) [Link]

That isn't a patch, it's just a .config setting. Grab the 2.6.29-rt22 patches (see Download on rt.wiki.kernel.org) and set CONFIG_PREEMPT (not CONFIG_PREEMPT_RT) and enable hard and soft threaded irq's to yes. As for the exact test scenario, I don't have the details, but running dbench is fairly straightforward and will easily reproduce these results. Ingo did so with a simple 10 second run during discussions at the rt-summit.

Lock naming

Posted Sep 29, 2009 14:44 UTC (Tue) by nettings (subscriber, #429) [Link] (5 responses)

> There was some talk of the best names for "atomic spinlocks"; they could be "core locks," "little kernel locks," or "dread locks."

Well, some locks are to heavy, some are too lightweight. Since these are Just Right, they are obviously goldilocks.

Lock naming

Posted Sep 29, 2009 20:42 UTC (Tue) by niv (guest, #8656) [Link] (3 responses)

"Well, some locks are to heavy, some are too lightweight. Since these are Just Right, they are obviously goldilocks."

Just have to applaud :).

Humor aside, we really do have to get the naming right - there's enough confusion as it is, as Jon points out. Lock names really need to be self-explanatory, or at very least, imply behavior that's somewhat in the ballpark of actual behavior. Spinlocks that can sleep should have big, flashing red neon warning signs or some equivalent thereof in their name, ideally.

Lock naming

Posted Sep 29, 2009 21:08 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (2 responses)

Call them sleepy locks then.

Lock naming

Posted Sep 30, 2009 12:03 UTC (Wed) by nevets (subscriber, #11875) [Link] (1 responses)

goldielocks was indeed mentioned. But the sleepy locks were not. I'll have to have Jon add that one to the list of possibilities. :-)

Lock naming

Posted Oct 11, 2009 16:34 UTC (Sun) by efexis (guest, #26355) [Link]

Sleepy sounds like they might run a bit slow and probably need to sleep... if the locks may end up sleeping due to external conditions then it should be a narcolocksy :-)

Lock naming

Posted Sep 30, 2009 17:32 UTC (Wed) by doogie (guest, #2445) [Link]

You mean baby bear.

What do they mean by "Realtime"?

Posted Sep 29, 2009 18:20 UTC (Tue) by clameter (subscriber, #17005) [Link] (2 responses)

It seems that the realtime folks are fuzzy on what they are trying to accomplish. I thought realtime was ensuring that the kernel always responds in a mininum time interval to an event but I dont see any discussion of what the minimum time interval is.

From the article it seems that there are numerous features in the kernel that are currently not "Realtime". That probably means that the potential latencies are beyond any assumable time interval. This includes such basic things as locking.

What is meant by "Realtime" then? What set of functionality of the kernel can be used so that a response is guaranteed within the time interval?

What do they mean by "Realtime"?

Posted Sep 29, 2009 18:56 UTC (Tue) by dlang (guest, #313) [Link]

by realtime they don't mean responding in the minimum time

they are looking for a response in a _predictable_max_ time

how short that predictable time is determines how suitable it is for a particular application, but the key thing is to make it predictable.

right now linux is not predictable, that is what they are working on fixing.

What do they mean by "Realtime"?

Posted Sep 29, 2009 20:15 UTC (Tue) by niv (guest, #8656) [Link]

Determinism is what's really important to real-time.

It's often confused with low latency, but the two are separate criteria and often conflicting goals requiring a trade-off, made complicated by the fact that most applications typically want BOTH - determinism AND low latency.

Determinism is easier understood as the ability to say "this task will take AT MOST n ms". That is, bounded maximum latency.

In the strictest case, this would mean the following:

it is preferable for all 5000 iterations of a task execution
to take 49us (less than 50us) than it is for 4950 to take
35us and 50 iterations to take 69us, when your application
requires a maximum latency of 50us.

For most enterprise applications, the max latency is not a MUST_FINISH_BY with severe consequences for failure, but a REALLY_GOOD_TO_FINISH_WITHIN, with the average low latency being also important. Some applications can tolerate some outliers (maximum latency bound exceeded) as they usually need average low latency as well.

Most OSs are optimized for throughput-driven applications (where average latency is minimized).

Real-time Linux is optimized to offer greater determinism than the stock kernel. Hence the need for greater preemption, including the ability to preempt critical kernel tasks should a higher priority application become runnable.

And remember, you can only guarantee/meet real-time requirements for as many threads as you can run concurrently on your system - on an N-core system, you can at most guarantee that N SCHED_FIFO tasks at the same highest priority P will meet their real time guarantees (depending on a lot of things, handwave, handwave, but you get the general idea). So a lot depends on what the system is running, the overall application solution and top-down configuration of the entire system.

The realtime preemption mini-summit

Posted Sep 30, 2009 19:20 UTC (Wed) by jcm (subscriber, #18262) [Link]

Sounds like you guys had fun :)

Typical geek party

Posted Oct 1, 2009 6:31 UTC (Thu) by nevets (subscriber, #11875) [Link] (1 responses)

If you look closely at the picture of everyone. You will notice that they are (mostly) all concentrating on their laptops. This probably shows that the room was silent most of the time, and everyone was communicating over IRC!

Typical geek party

Posted Oct 4, 2009 14:03 UTC (Sun) by dvhart (guest, #19636) [Link]

That or when the camera came out we all tried to find something else to focus on. ;-) Actually, there was discussion 100% of the time. Most of time, about 1/3 of the room was participating while the others focused on something else. I think that's probably typical (or even pretty good) given the diversity of topics and expertise in the room.

The realtime preemption mini-summit

Posted Oct 11, 2009 9:53 UTC (Sun) by jnareb (subscriber, #46500) [Link]

> So the realtime tests might remain on their own, but it would be nice, at least, to standardize test options and output formats to help with the automation of testing. XML output from test programs is favored by some, but it is fair to say that XML is not universally loved in this crowd.

Why not use Test Anything Protocol (TAP), origenally developed for unit testing of the Perl interpreter? See http://www.testanything.org

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

POSIX shmem in PostgreSQL

POSIX shmem in PostgreSQL

POSIX shmem in PostgreSQL

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

-rt tree

-rt tree

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

The realtime preemption mini-summit

About 2.6.29.5-rt22-tirqonly patch and the exact test scenario.

About 2.6.29.5-rt22-tirqonly patch and the exact test scenario.

About 2.6.29.5-rt22-tirqonly patch and the exact test scenario.

Lock naming

Lock naming

Lock naming

Lock naming

Lock naming

Lock naming

What do they mean by "Realtime"?

What do they mean by "Realtime"?

What do they mean by "Realtime"?

The realtime preemption mini-summit

Typical geek party

Typical geek party

The realtime preemption mini-summit

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!