The new pselect() system call

March 24, 2006

This article was contributed by Michael Kerrisk.

Applications like network servers that need to monitor multiple file descriptors using select(), poll(), or (on Linux) epoll_wait() sometimes face a problem: how to wait until either one of the file descriptors becomes ready, or a signal (say, SIGINT) is delivered. These system calls, as it turns out, do not interact entirely well with signals.

A seemingly obvious solution would be to write an empty handler for the signal, so that the signal delivery interrupts the select() call:

    static void handler(int sig) { /* do nothing */  }
    
    int main(int argc, char *argv[])
    {
        fd_set readfds;
        struct sigaction sa;
        int nfds, ready;
    
        sa.sa_handler = handler;     /* Establish signal handler */
        sigemptyset(&sa.sa_mask);
        sa.sa_flags = 0;
        sigaction(SIGINT, &sa, NULL);
	/* ... */    
        ready = select(nfds, &readfds, NULL, NULL, NULL);
	/* ... */

After select() returns we can determine what happened by looking at the function result and errno. If errno comes back as EINTR, we know that the select() call was interrupted by a signal, and can act accordingly. But this solution suffers from a race condition: if the SIGINT signal is delivered after the call to sigaction(), but before the call to select(), it will fail to interrupt that select() call and will thus be lost.

We can try playing various games like setting a global flag within the signal handler and monitoring that flag in the main program, and using sigprocmask() to block the signal until just before the select() call. However, none of these techniques can entirely eliminate the race condition: there is always some interval, no matter how brief, where the signal could be handled before the select() call is started.

The traditional solution to this problem is the so-called self-pipe trick, often credited to D J Bernstein. Using this technique, a program establishes a signal handler that writes a byte to a specially created pipe whose read end is also monitored by the select(). The self-pipe trick cleverly solves the problem of safely waiting either for a file descriptor to become ready or a signal to be delivered. However, it requires a relatively large amount of code to implement a requirement that is essentially simple. (For example, a robust solution requires marking both the read and write ends of the pipe non-blocking.)

For this reason, the POSIX.1g committee devised an enhanced version of select(), called pselect(). The major difference between select() and pselect() is that the latter call has a signal mask (sigset_t) as an additional argument:

    int pselect(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, 
                const struct timespec *timeout, const sigset_t *sigmask);

The sigmask argument specifies a set of signals that should be blocked during the pselect() call; it overrides the current signal mask for the duration of that call. So, when we make the following call:

    ready = pselect(nfds, &readfds, &writefds, &exceptfds, 
                    timeout, &sigmask);

the kernel performs a sequence of steps that is equivalent to atomically performing the following system calls:

    sigset_t sigsaved;

    sigprocmask(SIG_SETMASK, &sigmask, &sigsaved);
    ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
    sigprocmask(SIG_SETMASK, &sigsaved, NULL);

For some time now, glibc has provided a library implementation of pselect() that actually uses the above sequence of system calls. The problem is that this implementation remains vulnerable to the very race condition that pselect() was designed to avoid, because the separate system calls are not executed as an atomic unit.

Using pselect(), we can safely wait for either a signal to be delivered or a file descriptor to become ready, by replacing the first part of our example program with the following code:

        sigset_t emptyset, blockset;

        sigemptyset(&blockset);         /* Block SIGINT */
        sigaddset(&blockset, SIGINT);
        sigprocmask(SIG_BLOCK, &blockset, NULL);

        sa.sa_handler = handler;        /* Establish signal handler */
        sa.sa_flags = 0;
	sigemptyset(&sa.sa_mask);
        sigaction(SIGINT, &sa, NULL);
    
        /* Initialize nfds and readfds, and perhaps do other work here */
        /* Unblock signal, then wait for signal or ready file descriptor */

        sigemptyset(&emptyset);
        ready = pselect(nfds, &readfds, NULL, NULL, NULL, &emptyset);
        ...

This code works because the SIGINT signal is only unblocked once control has passed to the kernel. As a result, there is no point where the signal can be delivered before pselect() executes. If the signal is generated while pselect() is blocked, then, as with select(), the system call is interrupted, and the signal is delivered before the system call returns.

Although pselect() was conceived several years ago, and was already publicized in 1998 by W. Richard Stevens in his Unix Network Programming, vol. 1, 2nd ed., actual implementations have been slow to appear. Their eventual appearance in recent releases of various Unix implementations has been driven in part by the fact that the 2001 revision of the POSIX.1 standard requires a conforming system to support pselect(). With the 2.6.16 kernel release, and the required wrapper function that appears in the recently released glibc 2.4, pselect() also becomes available on Linux.

Linux 2.6.16 also includes a new (but nonstandard) ppoll() system call, which adds a signal mask argument to the traditional poll() interface:

   int ppoll(struct pollfd *fds, nfds_t nfds, const struct timespec *timeout, 
             const sigset_t *sigmask);

This system call adds the same functionality to poll() that pselect() adds to select(). Not to be left in the cold, the epoll maintainer has patches in the pipeline to add similar functionality in the form of a new epoll_pwait() system call.

There are a few other, minor differences between pselect() and ppoll() and their traditional counterparts. For example the type of the timeout is:

    struct timespec {
        long tv_sec;        /* Seconds */
        long tv_nsec;       /* Nanoseconds */
    };

This allows the timeout interval to be specified with greater precision than is available with the older system calls.

The glibc wrappers for pselect() and ppoll() also hide a couple of details of the underlying system calls.

First, the system calls actually expect the signal mask argument to be described by two arguments, one of which is a pointer to a sigset_t structure, while the other is an integer that indicates the size of that structure in bytes. This allows for the possibility of a larger sigset_t type in the future.

The underlying system calls also modify their timeout argument so that on an early return (because a file descriptor became ready, or a signal was delivered), the caller knows how much of the timeout remained. However, the respective wrapper functions hide this detail by making a local copy of the timeout argument and passing that copy to the underlying system calls. (The Linux select() system call also modifies its timeout argument, and this behavior is visible to applications. However, many other select() implementations don't modify this argument. POSIX.1 permits either behavior in a select() implementation.)

Further details of pselect() and ppoll() can be found in the latest versions of the select(2) and poll(2) man pages, which can be found here.

Index entries for this article

GuestArticles Kerrisk, Michael

Index entries for this article
GuestArticles	Kerrisk, Michael

The new pselect() system call

Posted Mar 30, 2006 8:31 UTC (Thu) by wingo (guest, #26929) [Link]

Nice article, thanks.

The new pselect() system call

Posted Mar 30, 2006 11:20 UTC (Thu) by nix (subscriber, #2304) [Link]

Now all we need is working userspace kernel headers for 2.6.16, so the new glibc can actually get at these new syscalls... :/

The new pselect() system call

Posted Mar 30, 2006 15:39 UTC (Thu) by clugstj (subscriber, #4020) [Link] (8 responses)

I don't understand why people want to use signals in their programs. They are a race condition nightmare to manage in combination with threads. I have always (20 years of experience) been able to rewrite programs to avoid signal usage except in the case of fatal error conditions.

The new pselect() system call

Posted Mar 30, 2006 19:05 UTC (Thu) by smoogen (subscriber, #97) [Link] (7 responses)

An example of how this is done might illuminate people who still use it.

The new pselect() system call

Posted Mar 30, 2006 19:21 UTC (Thu) by clugstj (subscriber, #4020) [Link] (5 responses)

OK, but first someone must explain to me what the signal is being used for. Then I will try to come up with a design that doesn't require a signal to perform this function. Usually, in the past, what I've seen is signals being used for timeouts. This is easily avoided by using select/poll (which have timeout parameters) before calling read/write instead of blocking in read/write.

The new pselect() system call

Posted Mar 30, 2006 19:58 UTC (Thu) by mikov (guest, #33179) [Link]

I completely agree. Signals should be avoided if possible - there are countless race conditions and apparentlky even buggy glibc functions (!) associated with them. To me signals have always seemed like a remnant from the past when they were used to supplement a lacking API. If anything the API should evolve to decrease the need for signals even further.

For comparison the Win32 API doesn't support signals.

The new pselect() system call

Posted Mar 31, 2006 4:24 UTC (Fri) by hppnq (guest, #14462) [Link] (2 responses)

Mmmh, I like the fact that system calls can be interrupted. :-)

Signals can be used for notifying process groups asynchronously of some event, for instance, which is likely to be a bit more tricky to do with select(). Yes, this is ancient Unix, but so is the concept of the tty.

Using select() as a replacement for signal handling is a quite tricky hack that only solves one specific problem and creates a couple of others, so I'm glad work is being done to implement the appropriate, standard interface. :-)

System Call Interruption

Posted Mar 31, 2006 13:30 UTC (Fri) by clugstj (subscriber, #4020) [Link] (1 responses)

System call interruption is not portable. Some UNIXes will restart some system calls after they handle the signal, some won't. Yes, signals are OK in non-threaded applications, but when the application uses threads, the race-condition nightmare begins.

System Call Interruption

Posted Mar 31, 2006 15:23 UTC (Fri) by hppnq (guest, #14462) [Link]

Portability is not an issue when your system call is actually interrupted. :-)

The complexity that arises from mixing signals and threads is only justified if you have specific reasons to implement it like that. By default, a multi-threaded process acts the same as a non-threaded process when interrupted.

So, unless you have specific reasons for defining per-thread signal masks (which is possible), there's nothing special about the multi-threaded case.

The new pselect() system call

Posted Mar 31, 2006 20:16 UTC (Fri) by giraffedata (guest, #1954) [Link]

How about the most basic use of signals: I have a server that waits for input on various sockets and I want to terminate the server. Sending SIGTERM is a conventional way to do that. I know it well, I know the tools ('kill') to do it like the back of my hand, and it works the same on most other processes, so it is very convenient for it to work on the server in question.

If the server is simple enough, the author doesn't have to write any code at all; the kernel will terminate it all by itself. If the server wants to e.g. finish up transactions in progress, the author can add a small amount of code to handle SIGTERM. But without signals, the author must write more code and design a special termination protocol. I then must remember to use it, and probably look up how to use the tool that initiates it, every time I want to terminate the server.

There are similar external interruption sort of things where signals are easier to implement and easier for the user to deal with. Think about a program in which a user types commands. One of those commands is taking too long and he wants to abort and go back to the program's command prompt. Ctl-C is an excellent way for him to interrupt, and rather easy to implement (Ctl-C normally generates a signal).

Signals also cut through layers. Let's say I run a program that calls a library function that calls a library function and so on and 5 levels deep there is a select(). I don't know or want to know anything about those deep libraries, but I know I don't want to wait more than 4 seconds for everything to finish. With a signal, I can get that select() to abort after 4 seconds without even knowing it's there. The only other way would be to add timeouts in every parameter list in the stack, and have every level keep track of elapsed time. More work for everybody.

I think of signals, when properly used with the modern interfaces, to solve the same problems that interrupts and "throwing" objects do.

Timeout, on the other hand, is what is misplaced on the select() call. Timing out should be handled either through a signal (it would have to be different from the existing global signals, so that you could use it inside a library) or a file descriptor that becomes ready at a certain time (not unlike the self-sending pipe thing that subsitutes for signals).

The new pselect() system call

Posted Apr 7, 2006 13:39 UTC (Fri) by lgb (guest, #784) [Link]

I've met with this problem when I wrote my little server capable of serving incoming requests with spawning special programs with fork() and than exec(). The "main" process is to handle accept(), and initialize communication between my server and the clients connecting, and when enough data is ready sent by client, I should start an external program and manage the data flow between it and my server. Of course I should keep track processes I've forked so SIGCHLD is needed. And I don't want to create new process for each incoming connection neither threads (I want my software run on both of Solaris and Linux, and I've learned that forking from multithtread application is not a very fast and portable way to do things), so I've decided to use nonblocking I/O and select(), also because of the need of taking care about my childs, signal handling is a must for me.

The new pselect() system call

Posted Mar 30, 2006 16:35 UTC (Thu) by mtk77 (guest, #6040) [Link] (6 responses)

The traditional way to deal with this is for the process to create a pipe and the select call to include that in its list. Then, the signal handler just writes a byte down the pipe.

My little inetd at http://hairy.beasts.org/whinetd/whinetd/src/whinetd.c uses this technique.

Self-pipe trick (and failings)

Posted Mar 30, 2006 21:03 UTC (Thu) by jreiser (subscriber, #11027) [Link] (5 responses)

The fourth paragraph of the article begins, "The traditional solution to this problem is the so-called self-pipe trick ...". That paragraph alleges non-obvious defects in such a trick. What is your response?

Self-pipe trick (and failings)

Posted Mar 30, 2006 22:13 UTC (Thu) by mkerrisk (subscriber, #1978) [Link] (4 responses)

John,

The fourth paragraph of the article begins, "The traditional solution to this problem is the so-called self-pipe trick ...". That paragraph alleges non-obvious defects in such a trick. What is your response?

I'm not sure if you are asking this question of me (author of article) or "mtk77" (who your article seems to reply to). Anyway, I don't allege any defects in the self-pipe trick; all I say is that it requires quite a bit of code to do things right:

We create a pipe, and use fcntl(F_SETFL) to mark both ends non-blocking (O_NONBLOCK).
We include the read end of the pipe in the readfds set given to select().
When the signal handler is called, it writes a byte to the pipe. By making the write end of the pipe non-blocking, we avoid the possibility of blocking in the signal handler if signals are delivered so rapidly that the pipe fills up. (If we fail to write a byte to the pipe because it is full, that doesn't matter: the presence of the existing bytes already informs us that the signal has been delivered.)

We place the select() call in a loop and restart it if interrupted by a signal:

    while ((ready = select(nfds, &readfds, NULL, NULL, NULL)) == -1 &
	     errno == EINTR)
         continue;

After the select() we can determine if the signal arrived by checking whether the read end of the pipe is among the descriptors returned in readfds.
We read all of the bytes from the pipe (so that we can know when new signals occur), and take whatever action is needed in response to the signal.

It all works fine, but pselect() allows us to achieve the same result with less code.

Self-pipe trick (and failings)

Posted Mar 31, 2006 1:21 UTC (Fri) by dougm (guest, #4615) [Link] (1 responses)

Just a nit: I think you want '&&' in the while condition, not '&'.

Self-pipe trick (and failings)

Posted Mar 31, 2006 3:41 UTC (Fri) by mkerrisk (subscriber, #1978) [Link]

Just a nit: I think you want '&&' in the while condition, not '&'.

Doh! Thanks, yes.

Self-pipe trick (and failings)

Posted Mar 31, 2006 13:43 UTC (Fri) by clugstj (subscriber, #4020) [Link] (1 responses)

Yes, it takes some code, but it only needs to be written and debugged once. Then tuck it into you comm library behind a simple interface and forget about the gory details. This is a problem that can be solved completely in user space. Adding a new system call to handle it is adding complexity to the kernel for no good reason.

It is not a better solution just because it takes less code (in user space).

Self-pipe trick (and failings)

Posted Apr 6, 2006 10:09 UTC (Thu) by renox (guest, #23785) [Link]

> Yes, it takes some code, but it only needs to be written and debugged once.

Well does glibc includes it?
No because the self-pipe trick cannot really be put into a library: it play tricks with signals, pipes which could have impact in your code, so each userspace program which need it must reimplement it, I'd hardly call this 'once'.

Whereas kernel implementation is really unique, so it is really better.
Plus it is conforming to POSIX standard, even better!

That said I find quite awful that glibc would implement pselect leaving the unsuspecting developer vulnerable to the race condition, it should be either implemented in the kernel or not at all (unless the library find a way to close the race condition of course).

The race still occurs :-(

Posted Jul 21, 2008 14:23 UTC (Mon) by almorozov (guest, #53014) [Link]

Hi
I have just encountered the race condition which the provided recipe tries 
to avoid :-(. The code looks similar to one in the article:

1. SIGCHLD is blocked
2. Then two child processes are invoked (in a "pipe" i.e. output of one 
process 
is the input of another).
3. A SIGCHLD handler similar to one described in 
info '(libc)Merged Signals' is installed

4. So far so good. Then I start to wait in pselect

Logs show that the first child exits, its termination signal is caught by 
the handler and pselect returns with exitcode=-1, errno=EINTR. Ok.

The code performs all necessary actions and returns back to 
pselect(). But when the second child exits there's a *possibility* that 
pselect won't return after signal is caught and processed in the handler.

I put a simple check that right before pselect() SIGCHLD is blocked 
(and it is). Logs show that the handler is invoked (that is the signal was 
unblocked in pselect) but strace'ing the process shows that it's hanging 
on select() :-(. It seems I missed some magic :-(.

Debian-4.0 under VZ-enabled 2.6.18 kernel.

Any equivalent to epoll_pwait in kqueue

Posted Nov 22, 2024 13:15 UTC (Fri) by TuhinK (guest, #174722) [Link]

Is is possible to prevent the mentioned race condition in kqueue ? Thank You

The new pselect() system call

The new pselect() system call

The new pselect() system call

The new pselect() system call

The new pselect() system call

The new pselect() system call

The new pselect() system call

The new pselect() system call

System Call Interruption

System Call Interruption

The new pselect() system call

The new pselect() system call

The new pselect() system call

Self-pipe trick (and failings)

Self-pipe trick (and failings)

Self-pipe trick (and failings)

Self-pipe trick (and failings)

Self-pipe trick (and failings)

Self-pipe trick (and failings)

The race still occurs :-(

Any equivalent to epoll_pwait in kqueue

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!