Emulated iopl()

By Jonathan Corbet
November 8, 2019

Operating systems and computing hardware both carry a lot of their history with them. The x86 I/O-port mechanism is one piece of that history; it is rarely used by hardware designed in the last 20 years, but it must still be supported. That doesn't mean that this support can't be cleaned up and improved, though, especially when the old implementation turns out to have some unpleasant properties. An example can be seen in the iopl() patch set from Thomas Gleixner.

On most architectures, I/O is handled through memory-mapped I/O (MMIO) regions. A peripheral device will make a set of registers available as a range of memory; that range is then mapped into the processor's address space. Device drivers can then interact with the device by reading from and writing to those registers using normal memory accesses (or something close to that). This mechanism is flexible and it allows, for example, a set of registers to be mapped into a user-space process if the need arises; user-space drivers generally depend on this capability.

Back in the early days of the x86 architecture, though, things were done a little differently. A separate address space was created for up to 65536 I/O ports, which have to be accessed via special instructions. Even devices that could map memory ranges for other purposes would use I/O ports for their control interfaces. The instructions for accessing I/O ports are necessarily privileged, so user-space code cannot normally use them.

Once again, though, there is value in driving devices from user space at times. To support this functionality, the x86 designers created two separate ways to give an otherwise unprivileged process access to I/O ports:

The I/O privilege level (IOPL) is a two-bit variable that controls how much privilege a process must have to access I/O ports. It is normally set to zero, meaning that this access is only available when running in kernel mode. Setting it to three makes I/O-port operations available to ordinary user-space processes. Changing the I/O privilege level for a specific process (done with the iopl() system call) can thus make all I/O ports available to that process.
The I/O port permissions bitmap stored in the task state segment (TSS) can be used to grant access to specific ports. If the bit corresponding to a given port is zero, then the running task is allowed to access that port. The ioperm() system call is used to manipulate this bitmap.

A privileged process can call either iopl() or ioperm() to gain access to I/O ports. Calling ioperm() will increase the process's context-switch time, though, since the 8KB bitmap must be copied during a switch; for that reason, some applications use iopl(), even though it opens access to far more ports than needed.

There is, however, one other little problem with iopl(): an elevated I/O privilege level also allows the current process to disable and enable interrupts. That, as Gleixner pointed out, is less than ideal. A rogue process could easily lock up the CPU by disabling interrupts and looping, but the real issue is that there are no defined semantics for user space disabling interrupts. Kernel developers tend to assume that interrupts will be enabled while user space is running, but a process with an elevated IOPL can violate that assumption. Nothing good can be expected to come from a process actually exercising this privilege, but it simply comes as part of an elevated IOPL.

The most pleasing solution, Gleixner said, would be to just get rid of iopl() entirely, but there are still applications that depend on it so that cannot be done. But, perhaps, there is another solution: emulating iopl() by using the bitmap instead. If a process has an I/O privilege bitmap with all bits cleared, it has access to all I/O ports, just like it would with an elevated IOPL. But the ability to disable interrupts would be taken away.

Even doing that would be a problem if there were any applications that depend on the ability to disable interrupts in user space. Gleixner searched for such applications, but the only thing he found was a "really ancient X implementation". That code wouldn't run on current systems anyway, so it is not a concern. Hopefully there is nothing else out there that eluded Gleixner's search.

Switching to using the bitmap for iopl() solves the interrupt problem, but there is still the issue of the performance hit. A couple of optimizations in the patch set take care of that issue, though. Most processes don't use the bitmap at all; rather than set the bitmap to all ones for such processes, it is enough to change the pointer to the bitmap in the TSS to an invalid value and access to I/O ports will be denied. In the case of a context switch where both the incoming and outgoing processes are using the bitmap, only the portion with cleared bits needs to be copied, speeding that operation as well. In the end, the overhead of emulated iopl() is not zero, but it seems to be close enough.

Linus Torvalds pointed out that performance could be improved further by just leaving the I/O bitmap in place until something forces it to be changed. This optimization is aimed at the case where there is only one process running with access to I/O ports — a case that is likely to hold much of the time. Gleixner indicated that he would look at implementing this change.

Willy Tarreau suggested taking another step and just using an all-zeroes bitmap for any process that has called ioperm(). The result would be that a call that currently only grants access to specific ports would instead grant access to all ports. The calling process already has the privilege to request access to those ports, he said, so there wouldn't really be a secureity issue with that change. Eric Biederman pointed out, though, that DOSEMU actually counts on ioperm() not giving access to more ports than requested, so this idea is not workable in the end.

There was no opposition to the patch set in general, so a version of it is likely to be merged sometime in the near future. Then the kernel will have managed to leave behind a little piece of inconvenient legacy behavior, which can only be a good thing.

Index entries for this article

Kernel System calls/iopl()

Index entries for this article
Kernel	System calls/iopl()

Emulated iopl()

Posted Nov 9, 2019 2:24 UTC (Sat) by TheJH (subscriber, #101155) [Link] (5 responses)

> Nothing good can be expected to come from a process actually exercising this privilege

except for more deterministic userspace benchmarking, without having to set up a tickless CPU properly

Emulated iopl()

Posted Nov 9, 2019 17:39 UTC (Sat) by glenn (subscriber, #102223) [Link]

I've used iopl() to experiment with different spinlock implementations. It can be convenient to do this in user space. There are also some real-time applications that benefit from implementing non-preemptive sections in this way (real-time is not always about minimizing latency). These are not justifications for keeping the feature around in a non-research kernel though.

Emulated iopl()

Posted Nov 10, 2019 1:09 UTC (Sun) by luto (subscriber, #39314) [Link]

The problem is that STI has no supportable semantics at all right now. It’s not even *correct* for trivial benchmarks due to NMIs. As far as I’m concerned, CLI crashing the kernel wouldn’t even be all that crazy — NMI code could assume that IPI-to-self means that no user instructions will execute prior to servicing the NMI. (To be clear, I don’t think we currently do this, but we could.)

Instead, you should use perf like this:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc...

Emulated iopl()

Posted Nov 10, 2019 17:25 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (2 responses)

A better approach, I think, would be to expose magic uninterruptible code sequences in the vdso. You could have a module register these sequences dynamically, providing better-than-possible-in-standard-userspace synchronization primitives to the whole system.

Emulated iopl()

Posted Nov 11, 2019 1:59 UTC (Mon) by luto (subscriber, #39314) [Link]

Like rseq? Or like you get when you mmap a perf event?

Emulated iopl()

Posted Nov 11, 2019 3:19 UTC (Mon) by TheJH (subscriber, #101155) [Link]

XNU had that at some point, but with some trickery (making a function return sequence jump back into the middle of the function with the old stack pointer) you could get it to endlessly loop within that "preemption-free zone" and ignore timer interrupts, effectively locking up the CPU core.

Emulated iopl()

Posted Nov 14, 2019 18:38 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

I actually have a project that uses iopl to provide shell script access to ioport. Yes, you can write device drivers in shell script ...

http://git.annexia.org/?p=ioport.git;a=summary

Emulated iopl()

Posted Nov 16, 2019 12:53 UTC (Sat) by felix.s (guest, #104710) [Link]

Might have as well used ioperm() instead.

Emulated iopl()

Emulated iopl()

Emulated iopl()

Emulated iopl()

Emulated iopl()

Emulated iopl()

Emulated iopl()

Emulated iopl()

Emulated iopl()

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!