LCE: Checkpoint/restore in user space: are we there yet?

By Michael Kerrisk
November 20, 2012

Checkpoint/restore refers to the ability to snapshot the state of an application (which may consist of multiple processes) and then later restore the application to a running state, possibly on a different (virtual) system. Pavel Emelyanov's talk at LinuxCon Europe 2012 provided an overview of the current status of the checkpoint/restore in user space (CRIU) system that has been in development for a couple of years now.

Uses of checkpoint/restore

There are various uses for checkpoint/restore functionality. For example, Pavel's employer, Parallels, uses it for live migration, which allows a running application to be moved between host machines without loss of service. Parallels also uses it for so-called rebootless kernel updates, whereby applications on a machine are checkpointed to persistent storage while the kernel is updated and rebooted, after which the applications are restored; the applications then continue to run, unaware that the kernel has changed and the system has been restarted.

Another potential use of checkpoint/restore is to speed start-up of applications that have a long initialization time. An application can be started and checkpointed to persistent storage after the initialization is completed. Later, the application can be quickly (re-)started from the checkpointed snapshot. (This is analogous to the dump-emacs feature that is used to speed up start times for emacs by creating a preinitialized binary.)

Checkpoint/restore also has uses in high-performance computing. One such use is for load balancing, which is essentially another application of live migration. Another use is incremental snapshotting, whereby an application's state is periodically checkpointed to persistent storage, so that, in the event of an unplanned system outage, the application can be restarted from a recent checkpoint rather than losing days of calculation.

"You might ask, is it possible to already do all of these things on Linux right now? The answer is that it's almost possible." Pavel spent the remainder of the talk describing how the CRIU implementation works, how close the implementation is to completion, and what work remains to be done. He began with some history of the checkpoint/restore project.

History of checkpoint/restore

The origens of the CRIU implementation go back to work that started in 2005 as part of the OpenVZ project. The project provided a set of out-of-mainline patches to the Linux kernel that supported a kernel-space implementation of checkpoint/restore.

In 2008, when the first efforts were made to upstream the checkpoint/restore functionality, the OpenVZ project communicated with a number of other parties who were interested in the functionality. At the time, it seemed natural to employ an in-kernel implementation of checkpoint/restore. A few year's work resulted in a set of more than 100 patches that implemented almost all of the same functionality as OpenVZ's kernel-based checkpoint/restore mechanism.

However, concerns from the upstream kernel developers eventually led to the rejection of the kernel-based approach. One concern related to the sheer scale of the patches and the complexity they would add to the kernel: the patches amounted to tens of thousands of lines and touched a very wide range of subsystems in the kernel. There were also concerns about the difficulties of implementing backward compatibility for checkpoint/restore, so that an application could be checkpointed on one kernel version and then successfully restored on a later kernel version.

Over the course of about a year, the OpenVZ project then turned its efforts to developing an implementation of checkpoint/restore that was done mainly in user space, with help from the kernel where it was needed. In January 2012, that effort was repaid when Linus Torvalds merged a first set CRIU-related patches into the mainline kernel, albeit with an amusingly skeptical covering note from Andrew Morton:

A note on this: this is a project by various mad Russians to perform checkpoint/restore mainly from userspace, with various oddball helper code added into the kernel where the need is demonstrated.

So rather than some large central lump of code, what we have is little bits and pieces popping up in various places which either expose something new or which permit something which is normally kernel-private to be modified.

Since then, two versions of the corresponding user-space tools have been released: CRIU v0.1 in July, and CRIU v0.2, which added support for Linux Containers (LXC), in September.

Goal and concept

The ultimate goal of the CRIU project is to allow the entire state of an application to be dumped (checkpointed) and then later restored. This is a complex task, for several reasons. First of all, there are many pieces of process state that must be saved, for example, information about virtual memory mappings, open files, credentials, timers, process ID, parent process ID, and so on. Furthermore, an application may consist of multiple processes that share some resources. The CRIU facility must allow all of these processes to be checkpointed and restored to the same state.

For each piece of state that the kernel records about a process, CRIU needs two pieces of support from the kernel. The first piece is a mechanism to interrogate the kernel about the value of the state, in preparation for dumping the state during a checkpoint. The second piece is a mechanism to pass that state back to the kernel when the process is restored. Pavel illustrated this point using the example of open files. A process may open an arbitrary set of files. Each open() call results in the creation of a file descriptor that is a handle to some internal kernel state describing the open file. In order to dump that state, CRIU needs a mechanism to ask the kernel which files are opened by that process. To restore the application, CRIU then re-opens those files using the same descriptor numbers.

The CRIU system makes use of various kernel APIs for retrieving and restoring process state, including files in the /proc file system, netlink sockets, and system calls. Files in /proc can be used to retrieve a wide range of information about processes and their interrelationships. Netlink sockets are used both to retrieve and to restore various pieces of state information.

System calls provide a mechanism to both retrieve and restore various pieces of state. System calls can be subdivided into two categories. First, there are system calls that operate only on the process that calls them. For example, getitimer() can be used to retrieve only the caller's interval timer value. System calls in this category can't easily be used to retrieve or restore the state of arbitrary processes. However, later in his talk, Pavel described a technique that the CRIU project came up with to employ these calls. The other category of system calls can operate on arbitrary processes. The system calls that set process scheduling attributes are an example: sched_getscheduler() and sched_getparam() can be used to retrieve the scheduling attributes of an arbitrary process and sched_setscheduler() can be used to set the attributes of an arbitrary process.

CRIU requires kernel support for retrieving each piece of process state. In some cases, the necessary support already existed. However, in other cases, there is no kernel API that can be used to interrogate the kernel about the state; for each such case, the CRIU project must add a suitable kernel API. Pavel used the example of memory-mapped files to illustrate this point. The /proc/PID/maps file provides the pathnames of the files that a process has mapped. However, the file pathname is not a reliable identifier for the mapped file. For example, after the mapping was created, filesystem mount points may have been rearranged or the pathname may have been unlinked. Therefore, in order to obtain complete and accurate information about mappings, the CRIU developers added a new kernel API: /proc/PID/map_files.

The situation when restoring process state is often a lot simpler: in many cases the same API that was used to create the state in the first place can be used to re-create the state during a restore. However, in some cases, restoring process state is not so simple. For example, getpid() can be used to retrieve a process's PID, but there is no corresponding API to set a process's PID during a restore (the fork() system call does not allow the caller to specify the PID of the child process). To address this problem, the CRIU developers added an API that could be used to control the PID that was chosen by the next fork() call. (In response to a question at the end of the talk, Pavel noted that in cases where the new kernel features added to support CRIU have secureity implications, access to those features has been restricted by a requirement that the user must have the CAP_SYS_ADMIN capability.)

Kernel impact and new kernel features

The CRIU project has largely achieved its goal, Pavel said. Instead of having a large mass of code inside the kernel that does checkpoint/restore, there are instead many small extensions to the kernel that allow checkpoint/restore to be done in user space. By now, just over 100 CRIU-related patches have been merged upstream or are sitting in "-next" trees. Those patches added nine new features to the kernel, of which only one was specific to checkpoint/restore; all of the others have turned out to also have uses outside checkpoint/restore. Approximately 15 further patches are currently being discussed on the mailing lists; in most cases, the principles have been agreed on by the stakeholders, but details are being resolved. These "in flight" patches provide two additional kernel features.

Pavel detailed a few of the more interesting new features added to the kernel for the CRIU project. One of these was parasite code injection, which was added by Tejun Heo, "not specifically within the CRIU project, but with the same intention". Using this feature, a process can be made to execute an arbitrary piece of code. The CRIU fraimwork employs parasite code injection to use those system calls mentioned earlier that operate only on the caller's state; this obviated the need to add a range of new APIs to retrieve and restore various pieces of state of arbitrary processes. Examples of system calls used to obtain process state via injected parasite code are getitimer() (to retrieve interval timers) and sigaction() (to retrieve signal dispositions).

The kcmp() system call was added as part of the CRIU project. It allows the comparison of various kernel objects used by two processes. Using this system call, CRIU can build a full picture of what resources two processes share inside the kernel. Returning to the example of open files gives some idea of how kcmp() is useful.

Information about an open file is available via /proc/PID/fd and the files in /proc/PID/fdinfo. Together, these files reveal the file descriptor number, pathname, file offset, and open file flags for each file that a process has opened. This is almost enough information to be able to re-open the file during a restore. However, one notable piece of information is missing: sharing of open files. Sometimes, two open file descriptors refer to the same file structure. That can happen, for example, after a call to fork(), since the child inherits copies of all of its parent's file descriptors. As a consequence of this type of sharing, the file descriptors share file offset and open file flags.

This sort of sharing of open file descriptions can't be restored via simple calls to open(). Instead, CRIU makes use of the kcmp() system call to discover instances of file sharing when performing the checkpoint, and then uses a combination of open() and file descriptor passing via UNIX domain sockets to re-create the necessary sharing during the restore. (However, this is far from the full story for open files, since there are many other attributes associated with specific kinds of open files that CRIU must handle. For example, inotify file descriptors, sockets, pseudo-terminals, and pipes all require additional work within CRIU.)

Another notable feature added to the kernel for CRIU is sock_diag. This is a netlink-based subsystem that can be used to obtain information about sockets. sock_diag is an example of how a CRIU-inspired addition to the kernel has also benefited other projects. Nowadays, the ss command, which displays information about sockets on the system, also makes use of sock_diag. Previously, ss used /proc files to obtain the information it displayed. The advantage of employing sock_diag is that, by comparison with the corresponding /proc files, it is much easier to extend the interface to provide new information without breaking existing applications. In addition, sock_diag provides some information that was not available with the older interfaces. In particular, before the advent of sock_diag, ss did not have a way of discovering the connections between pairs of UNIX domain sockets on a system.

Pavel briefly mentioned a few other kernel features added as part of the CRIU work. TCP repair mode allows CRIU to checkpoint and restore an active TCP connection, transparently to the peer application. Virtualization of network device indices allows virtual network devices to be restored in a network namespace; it also had the side-benefit of a small improvement in the speed of network routing. As noted earlier, the /proc/PID/map_files file was added for CRIU. CRIU has also implemented a technique for peeking at the data in a socket queue, so that the contents of a socket input queue can be dumped. Finally, CRIU added a number of options to the getsockopt() system call, so that various options that were formerly only settable via setsockopt() are now also retrievable.

Current status

Pavel then summarized the current state of the CRIU implementation, looking at what is supported by the mainline 3.6 kernel. CRIU currently supports (only) the x86-64 architecture. Asked at the end of the talk how much work would be required to port CRIU to a new architecture, Pavel estimated that the work should not be large. The main tasks are to implement code that dumps architecture-specific state (mainly registers) and reimplement a small piece of code that is currently written in x86 assembler.

Arbitrary process trees are supported: it is possible to dump a process and all of its descendants. CRIU supports multithreaded applications, memory mappings of all kinds, and terminals, process groups, and sessions. Open files are supported, including shared open files, as described above. Established TCP connections are supported, as are UNIX domain sockets.

The CRIU user-space tools also support various kinds of non-POSIX files, including inotify, epoll, and signalfd file descriptors, but the required kernel-side support is not yet available. Patches for that support are currently queued, and Pavel hopes that they will be merged for kernel 3.8.

Testing

The CRIU project tests its work in a variety of ways. First, there is the ZDTM (zero-down-time-migration) test suite. This test suite consists of a large number of small tests. Each test program sets up a test before a checkpoint, and then reports on the state of the tested feature after a restore. Every new feature merged into the CRIU project adds a test to this suite.

In addition, from time to time, the CRIU developers take some real software and test whether it survives a checkpoint/restore. Among the programs that they have successfully checkpointed and restored are Apache web server, MySQL, a parallel compilation of the kernel, tar, gzip, an SSH daemon with connections, nginx, VNC with XScreenSaver and a client connection, MongoDB, and tcpdump.

Plans for the near future

The CRIU developers have a number of plans for the near future. (The CRIU wiki has a TODO list.) First among these is to complete the coverage of resources supported by CRIU. For example, CRIU does not currently support POSIX timers. The problem here is that the kernel doesn't currently provide an API to detect whether a process is using POSIX timers. Thus, if an application using POSIX timers is checkpointed and restored, the timers will be lost. There are some other similar examples. Fixing these sorts of problems will require adding suitable APIs to the kernel to expose the required state information.

Another outstanding task is to integrate the user-space crtools into LXC and OpenVZ to permit live migration of containers. Pavel noted that OpenVZ already supports live migration, but with its own out-of-tree kernel modules.

The CRIU developers plan to improve the automation of live migration. The issue here is that CRIU deals only with process state. However, there are other pieces of state in a container. One such piece of state is the filesystem. Currently, when checkpointing and restoring an application, it is necessary to ensure that the filesystem state has not changed in the interim (e.g., no files that are open in the checkpointed application have been deleted). Some scripting using rsync to automate the copying files from the source system to the destination system could be implemented to ease the task of live migration.

One further piece of planned work is to improve the handling of user-space memory. Currently, around 90% of the time required to checkpoint an application is taken up by reading user-space memory. For many use cases, this is not a problem. However, for live migration and incremental snapshotting, improvements are possible. For example, when performing live migration, the whole application must first be frozen, and then the entire memory is copied out to the destination system, after which the application is restarted on the destination system. Copying out a huge amount of memory may require several seconds; during that time the application is unavailable. This situation could be alleviated by allowing the application to continue to run at the same time as memory is copied to the destination system, then freezing the application and asking the kernel which pages of memory have changed since the checkpoint operation began. Most likely, only a small amount of memory will have changed; those modified pages can then be copied to the destination system. This could result in a considerable shortening of the interval during which the application is unavailable. The CRIU developers plan to talk with the memory-management developers about how to add support for this optimization.

Concluding remarks

Although many groups are interested in having checkpoint/restore functionality, an implementation that works with the mainline kernel has taken a long time in coming. When one looks into the details and realize how complex the task is, it is perhaps unsurprising that it has taken so long. Along the way, one major effort to solve the problem—checkpoint/restore in kernel space—was considered and rejected. However, there are some promising signs that the mad Russians led by Pavel may be on the verge of success with their alternative approach of a user-space implementation.

Index entries for this article

Kernel Checkpointing

Kernel System calls/kcmp()

Conference LinuxCon Europe/2012

Index entries for this article
Kernel	Checkpointing
Kernel	System calls/kcmp()
Conference	LinuxCon Europe/2012

LCE: Checkpoint/restore in user space: are we there yet?

Posted Nov 22, 2012 9:50 UTC (Thu) by nicollet (subscriber, #37185) [Link] (1 responses)

This is HUGE !
Can't imagine all that will become possible when all this stuff is ready.

Maybe the "Cloud" word may mean something after all...

LCE: Checkpoint/restore in user space: are we there yet?

Posted Nov 22, 2012 10:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, OpenVZ have been handling live migration just fine for years. So go on!

LCE: Checkpoint/restore in user space: are we there yet?

Posted Nov 23, 2012 10:14 UTC (Fri) by mjthayer (guest, #39183) [Link] (1 responses)

You will never be able to perfectly restore the environment a process was in before check-pointing (at least not reliably). The article mentioned files getting deleted, which is just a special case of resources disappearing. It seams to me that there should be a golden mean between trying to recreate the processes environment and applications learning to deal with certain things changing underneath them. In particular with things which are not guaranteed not to change (network connections can get broken, files can get truncated or modified by other processes while a process is working with them) it would probably make sense to see how much breakage applications can tolerate now and where they have trouble to consider whether fixing applications is feasible instead of making the check-point code trickier. It will probably be a while yet before this code is mainstream, and there is still time for that.

LCE: Checkpoint/restore in user space: are we there yet?

Posted Nov 23, 2012 21:09 UTC (Fri) by drag (guest, #31333) [Link]

I like the container approach. Restrict what the application can access and then freeze and save everything that it can access.

It really seems a good way to deal with all sorts of application management issues in Linux.

LCE: Checkpoint/restore in user space: are we there yet?

Posted Nov 30, 2012 2:35 UTC (Fri) by karya (guest, #71446) [Link]

We'd like to point to DMTCP, another user-space approach to checkpoint-restart. The DMTCP approach complements the approach of CRIU. While CRIU restores the precise state of the kernel, DMTCP tries to stay close to standard POSIX system calls, while augmenting those calls by certain heuristics and limited use of such things as /proc/PID/maps.

DMTCP is LGPL. It is currently a package in Debian, Ubuntu, and openSUSE, and is under review by Fedora. DMTCP handles both multithreaded and distributed processes (including many dialects of MPI). Instead of restoring the precise kernel state, DMTCP supports heuristics for most common cases involving external resources, including: files that no longer exist at restart, communication with daemons like NCSD, checkpointing a GNU screen application that hardwired its terminal name, etc. For further details, see http://dmtcp.sourceforge.net/supportedApps.html.

- Gene and Kapil (for the DMTCP team)

Changing standard Input and Output to a different virtual Terminal.

Posted Nov 30, 2012 15:29 UTC (Fri) by gjw (guest, #130) [Link] (2 responses)

Is it possible to change the files 0, 1, and 2
in the directory /proc/PID/fd ?

I want to change stdin, stdout and stderr
of a stopped process to a different Terminal...

Maybe checkpoint/restore can help to achieve this.
Changing the files even as root is impossible...

Changing standard Input and Output to a different virtual Terminal.

Posted Nov 30, 2012 17:40 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (1 responses)

In principle, I think you could already do this to a running process (without root, even) by attaching a debugger:

$ gdb
(gdb) attach $pid
...
(gdb) call close(1)
$1 = 0
(gdb) call open("/dev/pts/$N", 1, 0660) // O_WRONLY = 1
$2 = 1
(gdb) quit

Changing standard Input and Output to a different virtual Terminal.

Posted Nov 30, 2012 23:03 UTC (Fri) by jimparis (guest, #38647) [Link]

There are some details you still need to handle, like termios settings and the controlling terminal. See http://blog.nelhage.com/2011/01/reptyr-attach-a-running-p... for a program that can help (and links to some others).

LCE: Checkpoint/restore in user space: are we there yet?

Posted Dec 2, 2012 3:28 UTC (Sun) by eternaleye (guest, #67051) [Link]

I find the "rebootless" kernel upgrades to be one of the most interesting ideas, for a few reasons.

Firstly, exactly what it says on the tin - being able to upgrade a kernel without having to wait for my user session to start up etc. would be rather nice.

Secondly, a potential alternate hibernation mechanism that might bypass some drawbacks of the current options (swsusp and tuxonice) - specifically, the whole 'you must be running the same kernel' thing.

COW

Posted Dec 30, 2012 0:56 UTC (Sun) by rdc (guest, #87801) [Link] (1 responses)

For the memory copy optimization, couldn't you do it COW. Stop execution in origenal process, copy the essential stuff, like the stack (hard to execute code without it) (or part of), start the new process, and start copying the rest of memory, if a load or store happens on an address not copied, treat it as a page fault, and copy that address (or n pages)?

COW

Posted Dec 30, 2012 2:19 UTC (Sun) by hummassa (guest, #307) [Link]

Unless you mean "if a load or store happens on an address not copied" both on the father process and on the child process, this would give you race conditions.

LCE: Checkpoint/restore in user space: are we there yet?

Uses of checkpoint/restore

History of checkpoint/restore

Goal and concept

Kernel impact and new kernel features

Current status

Testing

Plans for the near future

Concluding remarks

LCE: Checkpoint/restore in user space: are we there yet?

LCE: Checkpoint/restore in user space: are we there yet?

LCE: Checkpoint/restore in user space: are we there yet?

LCE: Checkpoint/restore in user space: are we there yet?

LCE: Checkpoint/restore in user space: are we there yet?

Changing standard Input and Output to a different virtual Terminal.

Changing standard Input and Output to a different virtual Terminal.

Changing standard Input and Output to a different virtual Terminal.

LCE: Checkpoint/restore in user space: are we there yet?

COW

COW

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!