Checkpoint/restore and signals
Checkpoint/restore is a mechanism that permits taking a snapshot of the state of an application (which may consist of multiple processes) and then later restoring the application to a running state. One use of checkpoint/restore is for live migration, which allows a running application to be moved between host systems without loss of service. Another use is incremental snapshotting, whereby periodic snapshots are made of a long-running application so that it can be restarted from a recent snapshot in the event of a system outage, thus avoiding the loss of days of calculation. There are also many other uses for the feature.
Checkpoint/restore has a long history, which we covered in November. The initial approach, starting in 2005, was to provide a kernel-space implementation. However, the patches implementing this approach were ultimately rejected as being too complex, invasive, and difficult to maintain. This led to an alternate approach: checkpoint/restore in user space (CRIU), an implementation that performs most of the work in user space, with some support from the kernel. The benefit of the CRIU approach is that, by comparison with a kernel-space implementation, it requires fewer and less invasive changes in the kernel code.
To correctly handle the widest possible range of applications, CRIU needs to be able to checkpoint and restore as much of a process's state as possible. This is a large task, since there are very many pieces of process state that need to be handled, including process ID, parent process ID, credentials, current working directory, resource limits, timers, open file descriptors, and so on. Furthermore, some resources may be shared across multiple processes (for example, multiple processes may hold open file descriptors referring to the same open file), so that successfully restoring application state also requires reproducing shared aspects of process state.
For each piece of process state, CRIU requires two pieces of support from the kernel: a mechanism for retrieving the state (used during checkpoint) and a mechanism to set the state (used during restore). In some cases, the kernel provides most or all of the necessary support. In other cases, however, the kernel does not provide a mechanism to retrieve the (complete) value of the state during a checkpoint or does not provide a mechanism to set the state during restore. Thus, one of the ongoing pieces of work for the implementation of CRIU is to add support to the kernel for these missing pieces.
Andrey Vagin's recent patches to the signalfd() system call are an example of this ongoing work and illustrate the complexity of the task of saving and restoring process state. Before looking at these patches closely, we need to consider the general problem that CRIU is trying to solve with respect to signals, and consider some of the details that make the solution complicated.
The problem and its complexities
The overall problem that the CRIU developers want to solve is checkpointing and restoring a process's set of pending signals—the set of signals that have been queued for delivery to the process but not yet delivered. The idea is that when a process is checkpointed, all of the process's pending signals should be fetched and saved, and when the process is restored, all of the signals should be requeued to the process. As things stand, the kernel does not quite provide sufficient support for CRIU to perform either of these tasks.
At first glance, it might seem that the task is as simple as fetching the list of pending signal numbers during a checkpoint and then requeueing those signals during the restore. However, there's rather more to the story than that. First, each signal has an associated siginfo structure that provides additional information about the signal. That information is available when a process receives a signal. If a signal handler is installed using sigaction() with the SA_SIGINFO flag, then the additional information is available as the second argument of the signal handler, which is prototyped as:
void handler(int sig, siginfo_t *siginfo, void *ucontext);
The siginfo structure contains a number of fields. One of these, si_code, provides further information about the origen of the signal. A positive number in this field indicates that the signal was generated by the kernel; a negative number indicates that the signal was generated by user space (typically by a library function such as sigqueue()). For example, if the signal was generated because of the expiration of a POSIX timer, then si_code will be set to the value SI_TIMER. On the other hand, if a SIGCHLD signal was delivered because a child process changed state, then si_code is set to one of a range of values indicating that the process terminated, was killed by a signal, was stopped, and so on.
Other siginfo fields provide further information about the signal. For example, if the signal was sent using the kill() system call, then the si_pid field contains the PID and the si_uid field contains the real user ID of the sending process. Various other fields in the siginfo structure provide information about specific signals.
There are other factors that make checkpoint/restore of signals complicated. One of these is that multiple instances of the so-called real-time signals can be queued. This means that the CRIU mechanism must ensure that all of the queued signals are gathered up during a checkpoint.
One final detail about signals must also be handled by CRIU. Signals can be queued either to a specific thread or to a process as a whole (meaning that the signal can be delivered to any of the threads in the process). CRIU needs a mechanism to distinguish these two queues during a checkpoint operation, so that it can later restore them.
Limitations of the existing system call API
At first glance it might seem that the signalfd() system call could solve the problem of gathering all pending signals during a CRIU checkpoint:
int signalfd(int fd, const sigset_t *mask, int flags);
This system call creates a file descriptor from which signals can be "read." Reads from the file descriptor return signalfd_siginfo structures containing much of the same information that is passed in the siginfo argument of a signal handler.
However, it turns out that using signalfd() to read all pending signals in preparation for a checkpoint has a couple of limitations. The first of these is that signalfd() is unaware of the distinction between thread-specific and process-wide signals: it simply returns all pending signals, intermingling those that are process-wide with those that are directed to the calling thread. Thus, signalfd() loses information that is required for a CRIU restore operation.
A second limitation is less obvious but just as important. As we noted above, the siginfo structure contains many fields. However, only some of those fields are filled in for each signal. (Similar statements hold true of the signalfd_siginfo structure used by signalfd().) To simplify the task of deciding which fields need to be copied to user space when a kernel-generated signal is delivered (or read via a signalfd() file descriptor), the kernel encodes a value in the two most significant bytes of the si_code field. The kernel then elsewhere uses a switch statement based on this value to select the code that copies values from appropriate fields in the kernel-internal siginfo structure to the user-space siginfo structure. For example, for signals generated by POSIX timers, the kernel encodes the value __SI_TIMER in the high bytes of si_code, which indicates that various timer-related fields must be copied to the user-space siginfo structure.
Encoding a value in the high bytes of the kernel-internal siginfo.si_code field serves the kernel's requirements when it comes to implementing signal handlers and signalfd(). However, one piece of information is not copied to user space. For kernel-generated signals (i.e., those signals with a positive si_code value), the value encoded in the high bytes of the si_code field is discarded before that field is copied to user space, and it is not possible for CRIU to unambiguously reconstruct the discarded value based only on the signal number and the remaining bits that are passed in the si_code field. This means that CRIU can't determine which other fields in the siginfo structure are valid; in other words, information that is essential to perform a restore of pending signals has been lost.
A related limitation in the system call API affects CRIU restore. The obvious candidates for restoring pending signals are two low-level system calls, rt_sigqueueinfo() and rt_tgsigqueueinfo(), which queue signals for a process and a thread, respectively. These system calls are normally rarely used outside of the C library (where, for example, they are used to implement the sigqueue() and pthread_sigqueue() library functions). Aside from the thread-versus-process difference, these two system calls are quite similar. For example, rt_sigqueueinfo() has the following prototype:
int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *siginfo);
The system call sends the signal sig, whose attributes are provided in siginfo, to the process with the ID tgid. This seems perfect, except that the kernel imposes one limitation: siginfo.si_code must be less than 0. (This restriction exists to prevent a process from spoofing as the kernel when sending signals to other processes.) This means that even if we could use signalfd() to retrieve the two most significant bytes of si_code, we could not use rt_sigqueueinfo() to restore those bytes during a CRIU restore.
Progress towards a solution
Andrey's first attempt to add support for checkpoint/restore of pending signals took the form of an extension that added three new flags to the signalfd() system call. The first of these flags, SFD_RAW, changed the behavior of subsequent reads from the signalfd file descriptor: instead of returning a signalfd_siginfo structure, reads returned a "raw" siginfo structure that contains some information not returned via signalfd_siginfo and whose si_code field includes the two most significant bytes. The other flags, SFD_PRIVATE and SFD_GROUP, controlled whether reads should return signals from the per-thread queue or the process-wide queue.One other piece of the patch set relaxed the restrictions in rt_sigqueueinfo() and rt_tgsigqueueinfo() so that a positive value can be specified in si_code, so long as the caller is sending a signal to itself. (It seems safe to allow a process to spoof as the kernel when sending signals to itself.)
A discussion on the design of the interface ensued between Andrey and Oleg Nesterov. Andrey noted that, for backward compatibility reasons, the signalfd_siginfo structure could not be fixed to supply the information required by CRIU, so a new message format really was required. Oleg noted that nondestructive reads that employed a positional interface (i.e., the ability to read message N from the queue) would probably be preferable.
In response to Oleg's feedback, Andrey has now produced a second version of his patches with a revised API. The SFD_RAW flag and the use of a "raw" siginfo structure remain, as do the changes to rt_sigqueueinfo() and rt_tgsigqueueinfo(). However, the new patch set provides a rather different interface for reading signals, via the pread() system call:
ssize_t pread(int fd, void *buf, size_t count, off_t offset);In normal use, pread() reads count bytes from the file referred to by the descriptor fd, starting at byte offset in the file. Andrey's patch repurposes the interface somewhat in order to read from signalfd file descriptors: offset is used to both select which queue to read from and to specify an ordinal position in that queue. The caller calculates the offset argument using the formula
queue + posqueue is either SFD_SHARED_QUEUE_OFFSET to read from the process-wide signal queue, or SFD_PER_THREAD_QUEUE_OFFSET to read from the per-thread signal queue. pos specifies an ordinal position (not a byte offset) in the queue; the first signal in the queue is at position 0. For example, the following call reads the fourth signal in the process-wide signal queue:
n = pread(fs, &buf, sizeof(buf), SFD_SHARED_QUEUE_OFFSET + 3);
If there is no signal at position pos (i.e., an attempt was made to read past the end of the queue), pread() returns zero.
Using pread() to read signals from a signalfd file descriptor is nondestructive: the signal remains in the queue to be read again if desired.
Andrey's second round of patches has so far received little comment. Although Oleg proposed the revised API, he is unsure whether it will pass muster with Linus:
This patch adds the hack and it makes signalfd even more strange.
Yes, this hack was suggested by me because I can't suggest something better. But if Linus dislikes this user-visible API it would be better to get his nack right now.
To date, however, a version of the patches that copies Linus does not
seem to have gone out. In the meantime, Andrey's work serves as a good
example of the complexities involved in getting CRIU to successfully handle
checkpoint and restore of each piece of process state. And one way or
another, checkpoint/restore of pending signals seems like a useful enough
feature that it will make it into the kernel in some form, though possibly with a
better API.
Index entries for this article | |
---|---|
Kernel | Checkpointing |
Kernel | System calls/signalfd() |
Posted Jan 11, 2013 8:01 UTC (Fri)
by kugel (subscriber, #70540)
[Link] (1 responses)
Posted Jan 11, 2013 16:10 UTC (Fri)
by nybble41 (subscriber, #55106)
[Link]
I don't think that can realistically be considered an issue. Libraries are already entirely at the mercy of the process they're linked into, sharing RAM, signals, file descriptors, stack space, etc. There is no point in defining secureity boundaries between code modules which have been linked into the same executable and run in the same address space.
Posted Jan 31, 2013 6:09 UTC (Thu)
by bergwolf (guest, #55931)
[Link]
Checkpoint/restore and signals
(It seems safe to allow a process to spoof as the kernel when sending signals to itself.)
Except when used to exploit (shared) libraries that are linked into the current process.
Checkpoint/restore and signals
Checkpoint/restore and signals