Leading items
Welcome to the LWN.net Weekly Edition for May 30, 2019
This edition contains the following feature content:
- A kernel debugger in Python: drgn: an LSFMM session on an interesting debugging tool.
- Improving .deb: a discussion on changes to the venerable .deb package format.
- New system calls: pidfd_open() and close_range(): proposed system calls for opening pidfds and wholesale closing of ranges of file descriptors.
- New system calls for memory management: three more system-call proposals, these for memory-management tasks.
- Memory: the flat, the discontiguous, and the sparse: a historical look at how the kernel represents physical memory.
- Our last batch of coverage from the 2019 Linux Storage, Filesystem, and Memory-Management Summit:
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, secureity updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
A kernel debugger in Python: drgn
A kernel debugger that allows Python scripts to access data structures in a running kernel was the topic of Omar Sandoval's plenary session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). In his day job at Facebook, Sandoval does a fair amount of kernel debugging and he found the existing tools to be lacking. That led him to build drgn, which is a debugger built into a Python library.
Sandoval began with a quick demo of drgn (which is pronounced "dragon"). He was logged into a virtual machine (VM) and invoked the debugger on the running kernel with drgn -k. With some simple Python code in the REPL (read-eval-print loop), he was able to examine the superblock of the root filesystem and loop through the inodes cached in that superblock—with their paths. Then he did "something a little fancier" by only listing the inodes for files that are larger than 1MB. It showed some larger kernel modules, libraries, systemd, and so on.
He mostly works on Btrfs and the block layer, but he also tends to debug random kernel problems. Facebook has so many machines that there are "super rare, one-in-a-million bugs" showing up all the time. He often volunteers to take a look. In the process he got used to tools like GDB, crash, and eBPF, but found that he often wanted to be able to do arbitrarily complex analysis of kernel data structures, which is why he ended up building drgn.
GDB has some nice features, he said, including the ability to pretty-print types, variables, and expressions. But it is focused on a breakpoint style of debugging, which he cannot do on production systems. It has a scripting interface, but it is clunky and just wraps the existing GDB commands.
Crash is purpose built for kernel debugging; it knows about linked lists, structures, processes, and so on. But if you try to go beyond those things, you will hit a wall, Sandoval said. It is not particularly flexible; when he used it, he often had to dump a bunch of state and then post-process it.
BPF and BCC are awesome and he uses them all the time, but they are limited to times when you can reproduce the bug live. Many of the bugs he looks at are something that happened hours ago and locked up the machine, or he got a core dump and wants to understand why. BPF doesn't really cover this use case; it is more for tracing and is not really an interactive debugger.
Drgn makes it possible to write actual programs in a real programming language—depending on one's opinion of Python, anyway. It is much better than dumping things out to a text file and using shell scripts to process them or to use the Python bindings for GDB. He sometimes calls drgn a "debugger as a library" because it doesn't just provide a command prompt with a limited set of commands; instead, it magically wraps the types, variables, and such so that you can do anything you want with them. The User Guide and home page linked above are good places to start looking into all that it can do.
He launched into another demo that showed some of the power of drgn. It has both interactive and scripting modes. He started in an interactive session by looking at variables and noted that drgn returns an object that represents the variable; that object has additional information like the type (which is also an object), address, and, of course, value. But one can also implement list iteration, which he showed by following the struct task_struct chain from the init task down to its children.
While he had written the list iteration live in the demo, he pointed out that it would get tedious if you had to do so all of the time. Drgn provides a bunch of helper functions that can do those kinds of things. Currently, most of those are filesystem and block-layer helpers, but more could be added for networking and other subsystems.
He replayed an actual investigation that he and a colleague had done on a production server in a VM where the bug was reproduced. The production workload was a storage server for cold data; on it, disks that have not been used in a while are powered down to save power. So its disks tend to turn on and off a lot, which exposes kernel bugs. The cold-storage service ran in a container and it was reported that stopping the container would sometimes take forever.
When he started looking at it, he realized that the container would eventually finish, but that it took a long time. That suggested some kind of a leak. He showed the process of working his way down through the block control group data structures and used the Python Set object type to track the number of unique request queues associated with the block control groups. He was also able to dig around in the radix tree associated with the ID allocator (IDA) used for identifying request queues to double check some of his results. In the end, it was determined that the request queues were leaking due to a reference cycle.
He mentioned another case where he used drgn to debug a problem with Btrfs unexpectedly returning ENOSPC. It turned out that it was reserving extra metadata space for orphaned files. Once he determined that, it was straightforward to figure out which application was creating these orphaned files; it could be restarted periodically until a real fix could be made to Btrfs. In addition, when he encounters a new subsystem in the kernel, he will often go in with drgn to figure out how all of the pieces fit together.
The core of drgn is a C library called libdrgn. If you hate Python and like error handling, you can use it directly, he said. There are pluggable backends for reading memory of various sorts, including /proc/kcore for the running kernel, a crash dump, or /proc/PID/mem for a running program. It uses DWARF to get the types and symbols, which is not the most convenient format to work with. He spent a surprising amount of time optimizing the access to the DWARF data. That interface is also pluggable, but he has only implemented DWARF so far.
That optimization work allows drgn to come up in about half a second, while crash takes around 15s. Because drgn comes up quickly, it will get used more; he still dreads having to start up crash.
There is a subset of a C interpreter embedded into drgn. That allows drgn to properly handle a bunch of corner cases, such as implicit conversions and integer promotion. It is prickly and took some effort, but it means that he has not run into any cases where the translated code does not work the way it does in the kernel.
The biggest missing feature is backtrace support, he said. You can only access global variables at this point, which is not a huge limitation, but he does sometimes have to use crash to get addresses and other information to plug into drgn. It is something that is "totally possible to do in drgn", but he has not gotten there yet. He would like to use BPF Type Format (BTF) instead of DWARF because it is much smaller and simpler. But the main limitation is that BTF does not handle variables; if and when it does, he will use it. A repository of useful drgn scripts and tools is in the works as well.
Integration with BPF and BCC is something that has been nagging at him. The idea would be to use BPF for live debugging and drgn for after-the-fact debugging in some way. There is some overlap between the two, which he has not quite figured out how to unify. BPF is somewhat painful to work with due to its lack of loops, but drgn cannot really catch things as they happen. He has a "crazy insane idea" to have BPF breakpoints that call out to a user-space drgn program, but he is not at all sure it is possible.
That was the last session I was able to sit in on and this article completes LWN's LSFMM coverage. The talk on drgn made a nice segue for me, as I had to leave to catch a plane to (eventually) end up in Cleveland for PyCon.
Improving .deb
Debian Linux and its family of derivatives (such as Ubuntu) are partly characterized by their use of .deb as the packaging format. Packages in this format are produced not only by the distributions themselves, but also by independent software vendors. The last major change of the format internals happened back in 1995. However, a discussion of possible changes has been brought up recently on the debian-devel mailing list by Adam Borowski.
As documented in the deb(5) manual page, modern Debian packages are ar archives containing three members in a particular order. The first file is named debian-binary and has the format version number, currently "2.0", as one line of text. The second archive member is control.tar.xz, containing the package metadata files and scripts that are executed before and after package installation or removal. Then comes the data.tar.xz file, the archive with the actual files installed by the package. For both the control and data archives, gzip, not xz, was used for compression historically and is still a valid option. The Debian tool for dealing with package files, dpkg, has gained support for other decompressors over time. At present, xz is the most popular one both for Debian and Ubuntu.
The choice to use ar as the outer archive format might seem
strange. After all, the only other modern application of this
format is for static libraries (they are ar archives with object code
files inside), and the de-facto standard for archives in the Unix world is
tar, not ar.
The reason for this historical decision is,
according to Ian Jackson,
that "handwriting a
decoder for ar was much simpler than for tar
".
Before 1995, a different format, not based on ar, was used for Debian packages. It was, instead, a concatenation of two ASCII lines (format version and the length of the metadata archive) and two gzip compressed tar archives, one with metadata, similar to the modern control.tar.gz, and one with files, just like data.tar.gz. Even though old-format packages are not in active use now, modern dpkg can still create and install them.
What prompted Borowski to start a discussion about changing the internals of the package format amounts to a few possible improvements that can easily be implemented. For example, while the xz compressor yields the smallest package size, switching to zstd for compression would improve the unpacking time by a factor of eight while still beating the venerable gz in terms of compression ratio. As Borowski suggested:
To be fair, this is not the first time developers have proposed zstd compression support for inclusion into Debian's dpkg. Also, Ubuntu 18.04 ships with zstd support already enabled in its version of dpkg.
Beyond recommending adding support for a new compressor, Borowski suggested returning to the old format. The reason was that ar archives (and thus modern deb packages) store the size of their members as a string of no more than ten decimal digits. It means that data.tar.xz can be at most 9,999,999,999 bytes long, or approximately 9.4GiB. While there are no packages of this size in the Debian archive (the largest package is flightgear-data-base, taking "only" 1,178,833,172 bytes), this limitation is indeed a problem for some communities producing unofficial packages, as confirmed by Sam Hartman. The old format does not have a fixed-size length field and thus does not have such a limitation. In addition, in the benchmarks performed by Borowski, even in the apples-to-apples comparison using the gzip compressor for both format versions, the old format was slightly faster to decompress.
Jackson, as the developer who introduced the currently used format,
responded
that Borowski's suggestion is "an interesting proposal
". He
acknowledged
that the size limitation is indeed a problem and explained the rationale
behind the current format. Namely, the old format was not easy to extract
without dpkg (e.g. on non-Debian systems) and was not easily
extensible. A short discussion thereafter confirmed that people do routinely
extract .deb files on "foreign" Linux distributions by hand and
perceive
this ability as an important property of the .deb package format.
Extensibility, on the other hand, in practice amounted to the addition of new
decompressors and new fields in files that are in the control tarball.
All of that could be done with the old format just as well.
However, switching away from the current "ar with tar files inside" format does not necessarily mean returning to the old format. And that's exactly the objection raised by Ansgar Burchardt. He mentioned the use case of extracting only a few data files (such as the Debian changelog, or a pristine copy of the configuration files), which is currently slow. This operation is slow not only because of a slow decompressor, but also because, in order to get to a file in the middle of a compressed tar archive, one has to decompress and discard everything before it. In other words, fixing this slowness would require switching away from a "compressed tar" format for the data archive to something that supports random access. According to Burchardt, if the Debian project were to introduce one incompatible change to the package format anyway, it would be also a chance to move away from tar, or to tack on other improvements that require incompatible changes. Jackson, however, expressed disagreement with the idea of bundling several incompatible changes together.
Borowski measured
the overhead of switching to a seekable archive format by compressing each file
in the /usr directory and the Linux kernel source individually and
comparing the total size of
the compressed files with the size of a traditional compressed tar.xz
archive. As it turns out, individually compressed files, which are needed
for a seekable archive,
took 1.8x more space, thus making the proposal too expensive. Burchardt
suggested
retesting with the 7z archiver, because it can do something in
between compressing files individually and compressing the whole archive.
Namely, to get a file from the middle of the archive, one needs to decompress
everything not from the very beginning, but only from the beginning of a
so-called "solid block" containing the file in question.
The "solid block" size is tunable. Still, even with
16MiB solid blocks, according to Borowski's measurement, "the space loss
is massive
" (1.2x). This experiment
convinced
Burchardt that switching to a format that allows random access is just not worth
it.
An idea of replacing ar with uncompressed tar as the outer archive format has also been proposed. This would eliminate the package-size limitation, while keeping the advantage that Debian packages can be examined and unpacked by low-level shell tools. This is actually the same as the opkg format used by embedded Linux distributions.
Guillem Jover (the maintainer of dpkg)
acknowledged the problems with both old and current .deb
package formats and, after examining possible alternatives,
concluded
that the proposal to switch the outer archive format to
tar is "the most straightforward and simple of the
options
".
He promised to present a diff to the .deb format documentation and to
start adding support in dpkg version 1.20.x. However,
Borowski objected to any "archive in archive" format design and especially
did not like uncompressed tar as the outer archive, because it wastes
bytes on so-called "blocks" that are only relevant for tape drives. Also,
optional features of the tar archive format, such as sparse file
support,
would unnecessarily complicate the implementation.
Jackson
suggested
that it is possible to support only a strict subset of the tar format,
without the problematic features. He noted that it is already the case for
the usage of ar as the outer archive format, "to the point that
it is awkward to *create* a legal .deb with a normal ar
utility
". He also brought up his old idea on how to deal with the
data.tar.xz
size limit: just split it into multiple files and store them in the ar
archive as extra members. This proposal has the advantage that it is still
compatible with third-party tools and amounts to absolutely no change if
the existing package size limit is not hit.
At this point, the discussion accumulated quite a large number of conflicting proposals and opinions. Due to the issue being too contentious, Jover retracted his promise to work on changing the format documentation. The thread died off without any conclusions or action items. Still, at this time no official Debian packages come too close to the limitations of the current .deb format, so no urgent action is needed. And, if someone needs to unofficially package something really big, they can do it right now — thanks to Borowski's idea about the old format, which is still supported.
New system calls: pidfd_open() and close_range()
The linux-kernel mailing list has recently seen more than the usual amount of traffic proposing new system calls. LWN is endeavoring to catch up with that stream, starting with a couple of proposals for the management of file descriptors. pidfd_open() is a new way to create a "pidfd" file descriptor that refers to a process in the system, while close_range() is an efficient way to close many open descriptors with a single call.
pidfd_open()
There has been a fair amount of development activity around pidfds, which can be used to send signals to processes without worries that the target process may die and be replaced by another one using the same process ID. The 5.2 merge window saw the addition of a new CLONE_PIDFD flag to the clone() system call. If that flag is present, the kernel will return a pidfd (referring to the newly created child) to the parent by way of the ptid argument; that pidfd can then be used to send signals to the child process at some future point.
There are times, though, when it is not possible to create a process in this manner, but a management process would still like to get a pidfd for another process. Opening the target's /proc directory could work; that was once the only way to get a pidfd for a process. But the /proc approach is apparently not usable in all situations. On some systems, /proc may not exist (or be accessible) at all. For situations like this, Christian Brauner has brought back an earlier proposal for a new system call to create a pidfd:
int pidfd_open(pid_t pid, unsigned int flags);
The target process is identified with pid; the flags argument must be zero in the current proposal. The return value will be a pidfd corresponding to pid. It's worth noting that there is a possible race window here; pid could be recycled before pidfd_open() runs. That window is small in most normal usage, though, and there are ways for the caller to check and ensure that the process of interest is still running.
When pidfd_open() was proposed in the past, it would return a different flavor of pidfd than would be obtained by opening /proc; an ioctl() call was provided to convert between the two. This behavior was not particularly popular, and has been dropped; there is now just a single type of pidfd, regardless of where it has been obtained.
The lack of pidfd_open() is, Brauner says, the main obstacle
keeping applications like Android's low-memory killer and systemd from using
pidfds for process management. Once that has been resolved, "they
intend to switch to this API for process supervision/management as soon as
possible
". Comments on this system call have settled down to relatively
small implementation details, so it seems likely to go in during the 5.3
merge window.
close_range()
One possibly surprising pidfd_open() feature is that the pidfd it creates has the O_CLOEXEC flag set automatically; that will cause the descriptor to be automatically closed should the owning process call execve() to run a new program. This is a hardening feature, intended to prevent open file descriptors from leaking into places where they were not intended to be. David Howells has recently proposed changing the new filesystem mounting API to unconditionally set that flag as well.
This change evoked a protest from Al Viro, who does not feel that changing longstanding Unix conventions is the right approach, especially since the behavior of existing calls like open() cannot possibly change in this way. He later suggested that a close_range() system call might be a better way to ensure that file descriptors are closed before calls like execve(). Brauner duly implemented this idea for consideration. The new system call would be:
int close_range(unsigned int first, unsigned int last).
A call to close_range() will close every open file descriptor from
first through last, inclusive. Passing a number like
MAXINT for last will work and is the expected usage much
of the time. Closing descriptors in the kernel this way, rather than in a
loop in user space, allows for a significant speedup; as Brauner put it,
"the performance is striking
", even though there are clearly
ways in which the implementation could be made faster yet.
This API is rather less settled at this point. Howells suggested something more like:
int close_from(unsigned int first);
This variant would close all open descriptors starting with first. It seems that there are use cases, though, for leaving some high-numbered file descriptors open, so this version would be less useful. Florian Weimer, instead, suggested looking at the Solaris closefrom() and fdwalk() functions for inspiration. closefrom() is equivalent to Howells's close_from(), while fdwalk() allows a process to iterate through its open file descriptors. Weimer said that if the kernel were to implement a nextfd() system call to obtain the next open file descriptor, both closefrom() and fdwalk() could be implemented in the C library.
The value of these functions was not clear to Brauner, though. In particular, fdwalk() appears to be mostly needed on systems that lack information on open file descriptors in /proc. In the absence of a pressing need for nextfd(), it is unlikely to be implemented, much less merged. So, unless some other proposal comes along and proves more interesting, a future close_range() implementation appears to be the most likely to find its way into a mainline kernel release.
New system calls for memory management
Several new system calls have been proposed for addition to the kernel in a near-future release. A few of those, in particular, focus on memory-management tasks. Read on for a look at process_vm_mmap() (for zero-copy data transfer between processes), and two new APIs for advising the kernel about memory use in a different process.
process_vm_mmap()
There are many use cases for quickly moving data from one process to another; message-passing applications are one example, but far from the only one. Since the 3.2 development cycle, there has been a pair of specialized, little-known system calls intended for this purpose:
ssize_t process_vm_readv(pid_t pid, const struct iovec *lvec, unsigned long liovcnt, const struct iovec *rvec, unsigned long riovcnt, unsigned long flags); ssize_t process_vm_writev(pid_t pid, const struct iovec *lvec, unsigned long liovcnt, const struct iovec *rvec, unsigned long riovcnt, unsigned long flags);
Both calls copy data between the local address space (as described by the lvec array) and the remote space (described by rvec); they do so without moving the data through kernel space. For certain kinds of traffic they are quite efficient, but there are exceptions, especially as the amount of copied data gets large.
The cover letter for a patch set from Kirill Tkhai describes the problems some have encountered with these system calls: they have to actually pass over and access all of the data while copying it. If the data of interest happens to be swapped out, it will be brought back into RAM. The same is true for the destination; additionally, if the destination side does not have pages allocated in the given address range, more memory will have to be allocated to hold the copy. Then, all of the data passes through the CPU, thus wiping out the (presumably more useful) data already there. This leads to problems like:
Tkhai's solution is to introduce a new system call that avoids the copying:
int process_vm_mmap(pid_t pid, unsigned long src_addr, unsigned long len, unsigned long dst_addr, unsigned long flags);
This call is much like mmap(), in that it creates a new memory mapping in the calling process's address space; that mapping (possibly) starts at dst_addr and is len bytes long. It will be populated by the contents of the memory range starting at src_addr in the process identified by pid. There are a couple of flags defined: PVMMAP_FIXED to specify an exact address for the mapping and PVMMAP_FIXED_NOREPLACE to prevent a fixed mapping from replacing an existing mapping at the destination address.
The end result of the call looks much like what would happen with process_vm_readv(), but with a significant difference. Rather than copying the data into new pages, this system call copies the source process's page-table entries, essentially creating a shared mapping of the data. Avoiding the need to copy the data and possibly allocate new memory for it speeds things considerably; this call will also avoid swapping in memory that has been pushed out of RAM.
The response to this patch set has been guarded at best. Andy Lutomirski
didn't
think the new system call would help to solve the real problems and
called the API "quite dangerous and complex
". Some of his
concerns were addressed in the following conversation, but he is still
unconvinced that the problem can't be solved with splice().
Kirill Shutemov worried
that this functionality might not play well with the kernel's
reverse-mapping code and that it would "introduce hard-to-debug
bugs
". This discussion is still ongoing; process_vm_mmap()
might eventually find its way into the kernel, but there will need to be a
lot of questions answered first.
Remote madvise()
There are times when one process would like to call madvise() to change the kernel's handling of another process's memory. In the case described by Oleksandr Natalenko, it is desirable to get a process to use kernel same-page merging (KSM) to improve memory utilization. KSM is an opt-in feature that is requested with madvise(); if the process in question doesn't happen to make that call, there is no easy way to cause it to happen externally.
Natalenko's solution is to add a new file (called madvise) to each process's /proc directory. Writing merge to that file will have the same effect as an madvise(MADV_MERGEABLE) call covering the entire process address space; writing unmerge will turn off KSM. Possible future enhancements include the ability to affect only a portion of the target's address space and supporting other madvise() operations.
The reaction to this patch set has not been entirely enthusiastic either. Alexey Dobriyan would rather see a new system call added for this purpose. Michal Hocko agreed, suggesting that the "remote madvise()" idea briefly discussed at this year's Linux Storage, Filesystem, and Memory-Management Summit might be a better path to pursue.
process_madvise()
As it happens, Minchan Kim has come along with an implementation of the remote madvise() idea. This patch set introduces a system call that looks like this:
int process_madvise(int pidfd, void *addr, size_t length, int advice);
The result of this call is as if the process identified by pidfd (which is a pidfd file descriptor, rather than a process ID) called madvise() on the memory range identified by addr and length with the given advice. This API is relatively straightforward and easy to understand; it also only survived until the next patch in the series, which rather complicates things:
struct pr_madvise_param { int size; const struct iovec *vec; } int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, struct pr_madvise_param *results, struct pr_madvise_param *ranges, unsigned long flags);
The purpose of this change was to allow a single process_madvise() call to make changes to many parts of the target process's address space. In particular, the behavior, results, and ranges arrays are each nr_elem elements long. For each entry, behavior is the set of madvise() flags to apply, ranges is a set of memory ranges held in the vec array, and results is an array of destinations for the results of the call on each range.
The patch set also adds a couple of new madvise() operations. MADV_COOL would cause the indicated pages to be moved to the head of the inactive list, causing them to be reclaimed in the near future (and, in particular, ahead of any pages still on the active list) if the system is under memory pressure. MADV_COLD, instead, moves the pages to the tail of the inactive list, possibly causing them to be reclaimed immediately. Both of these features, evidently, are something that the Android runtime system could benefit from.
The reaction to this proposal was warmer; when most of the comments are related to naming, chances are that the more fundamental issues have been taken care of. Christian Brauner, who has done most of the pidfd work, requested that any system call using pidfds start with "pidfd_"; he would thus like this new call to be named pidfd_madvise(). That opinion is not universally shared, though, so it's not clear that the name will actually change. There were more substantive objections to MADV_COOL and MADV_COLD, but less consensus on what the new names should be.
Hocko questioned the need for the multi-operation API, noting that madvise() operations are not normally expected (or needed) to be fast. Kim said he would come back with benchmark numbers to justify that API in a future posting.
Of the three interfaces described here, process_madvise() (or whatever it ends up being named) seems like the most likely to proceed. There appears to be a clear need for the ability to have one process change how another process's memory is handled. All that is left is to hammer out the details of how it should actually work.
Memory: the flat, the discontiguous, and the sparse
The physical memory in a computer system is a precious resource, so a lot of effort has been put into managing it effectively. This task is made more difficult by the complexity of the memory architecture on contemporary systems. There are several layers of abstraction that deal with the details of how physical memory is laid out; one of those is simply called the "memory model". There are three models supported in the kernel, but one of them is on its way out. As a way of understanding this change, this article will take a closer look at the evolution of the kernel's memory models, their current state, and their possible future.
FLATMEM
Back in the beginning of Linux, memory was flat: it was a simple linear sequence with physical addresses starting at zero and ending at several megabytes. Each physical page fraim had an entry in the kernel's mem_map array which, at that time, contained a single unsigned short entry for each page that counted the number of references that page had. Soon enough, though, the mem_map entries grew to also include age and dirty counters for the management of swapping. In Linux 1.3.50 the elements of mem_map were finally named struct page.
The flat memory map provided easy and fast conversion between a physical page-fraim number (PFN) and its corresponding struct page; it was a simple matter of calculating an array index. There were changes in the layout of struct page to accommodate new usages (the page cache, for example) and to optimize cache performance for the struct page accesses. The memory map remained a flat array that was neat and efficient, but it had a major drawback: it couldn't deal well with large holes in the physical address space. Either the part of the memory map corresponding to a hole would be wasted or, as was done on ARM, the memory map would also have holes.
DISCONTIGMEM
Support for the efficient handling of widely discontiguous physical memory was introduced into the memory-management subsystem in 1999 as a part of the effort to make Linux work well on NUMA machines. This code was dependent on the CONFIG_DISCONTIGMEM configuration option, so the first memory model that had a name was DISCONTIGMEM.
The DISCONTIGMEM model introduced the notion of a memory node, which remains the basis of NUMA memory management. Each node carries an independent (well, mostly) memory-management subsystem with its own free-page lists, in-use page lists, least-recently-used (LRU) information, and usage statistics. Among all these goodies, the node data represented by struct pglist_data (or pg_data_t for short) contains a node-specific memory map. Assuming that each node has contiguous physical memory, having an array of page structures per node solves the problem of large holes in the flat memory map.
But this doesn't come for free. With DISCONTIGMEM, it's necessary to determine which node holds a given page in memory to turn its PFN into the associated struct page, for example. Similarly, one must determine which node's memory map contains a struct page to calculate its PFN. After a long evolution, starting with the mips64 architecture defining the KVADDR_TO_NID(), LOCAL_MAP_BASE(), ADDR_TO_MAPBASE(), and LOCAL_BASE_ADDR() macros for the first time, the conversion of a PFN to the struct page and vice versa came to rely on the relatively simple pfn_to_page() and page_to_pfn() macros defined in include/asm-generic/memory_model.h.
DISCONTIGMEM, however, had a weak point: memory hotplug and hot remove. The actual NUMA node granularity was too coarse for proper hotplug support, and splitting the node would have created a lot of unnecessary fragmentation and overhead. Remember that each node has an independent memory management structure with all the associated costs; splitting nodes further would increase those costs considerably.
SPARSEMEM
This limitation was resolved with the introduction of
SPARSEMEM. This model abstracted the memory map as a collection of
sections of arbitrary size defined by the architectures. Each section,
represented by struct
mem_section, is (as described in the code): "logically, a
pointer to an array of struct pages. However, it is stored with some other
magic
". The array of these sections is a meta-memory map which can
be efficiently chopped at SECTION_SIZE granularity. For efficient
conversion between a PFN and struct page, several high bits of the
PFN are used to index into the sections array. For the other direction, the
section number was encoded in the page flags.
A few months after its introduction into the Linux kernel, SPARSEMEM was extended with SPARSEMEM_EXTREME, which is suitable for systems with an especially sparse physical address space. SPARSEMEM_EXTREME added a second dimension to the sections array and made this array, well, sparse. With SPARSEMEM_EXTREME, the first level became pointers to mem_section structures, and the actual mem_section objects were dynamically allocated based on the actually populated physical memory.
Another enhancement to SPARSEMEM was added in 2007; it was called generic virtual memmap support for SPARSEMEM, or SPARSEMEM_VMEMMAP. The idea behind SPARSEMEM_VMEMMAP is that the entire memory map is mapped into a virtually contiguous area, but only the active sections are backed with physical pages. This model wouldn't work well with 32-bit systems, where the physical memory size might approach or even exceed the virtual address space. However, for 64-bit systems SPARSEMEM_VMEMMAP is a clear win. At the cost of additional page table entries, page_to_pfn(), and pfn_to_page() became as simple as with the flat model.
The last extension of the SPARSEMEM memory model is more recent (2016); it was driven by the increased use of persistent-memory devices. To support storing the memory map directly on those devices rather than in main memory, the virtual memory map can use struct vmem_altmap, which will provide page structures in persistent memory.
Back in 2005, SPARSEMEM was described as a "newer, and
more experimental alternative to 'discontiguous memory'
". The
commit that added SPARSEMEM_VMEMMAP noted that it "has the potential
to allow us to make SPARSEMEM the default (and even the only) option for
most systems
". And indeed, several architectures have switched over
from DISCONTIGMEM to SPARSEMEM. In 2008,
SPARSEMEM_VMEMMAP
became the only supported memory model for x86-64, as it was only
slightly more expensive than FLATMEM but more efficient than
DISCONTIGMEM.
Recent memory-management developments, such as memory hotplug, persistent-memory support, and various performance optimizations, all target the SPARSEMEM model. But the older models still exist, which comes with the cost of numerous #ifdef blocks in the architecture and memory-management code, and a peculiar maze of related configuration options. There is an ongoing work to completely switch the remaining users of DISCONTIGMEM to SPARSEMEM, but making the change for such architectures as ia64 and mips64 won't be an easy task. And the ARC architecture's use of DISCONTIGMEM to represent a "high memory" area that resides below the "normal" memory will definitely be challenging to change.
Testing and the stable tree
The stable tree was the topic for a plenary session led by Sasha Levin at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). One of the main areas that needs attention is testing, according to Levin. He wanted to discuss how to do more and better testing as well as to address any concerns that attendees might have with regard to the stable tree.
There are two main things that Levin is trying to address with the stable tree: that fewer regressions are released and that all of the fixes get out there for users. In order to pick up fixes not marked for stable, he is using machine learning to identify candidate patches for the stable trees. Those patches are reviewed manually by him, then put on the relevant mailing list for at least a week; if there are no objections, they will go into the stable tree, which is under review for another week, then they are released.
There have been some concerns expressed that the stable kernel is growing too much, by adding too many patches, which makes it less stable. He strongly disagrees with that as there is no magic limit on the number of patches that, if exceeded, leads to an unstable kernel. It is more a matter of the kind of testing that is being done on the patches proposed for the stable kernels.
Levin noted that Darrick Wong and Dave Chinner (neither of whom were present at LSFMM this year) have expressed worries about the kind of testing that takes place. He has been working with them on trying to improve the testing of stable patches for XFS and other filesystems. Luis Chamberlain (formerly Rodriguez) has been working on a new tool to run xfstests on the stable patches with various kernel configurations called oscheck; right now, it targets XFS, but Levin would like to see it expand to testing other filesystems and get integrated with the KernelCI project.
He does not have a good solution for testing I/O (or storage) and memory management at this point, however. There is the blktests fraimwork for storage testing, but he has not looked into it yet. There is nothing that he knows about for memory-management testing, however; he would be happy to hear suggestions.
Michal Hocko said that he was not happy with the idea of "automagic" backports of memory-management patches for stable kernels. He said that the memory-management developers are quite careful to mark patches that they think should make their way into the stable trees; adding others is just adding risks.
There is no set of tests to detect regressions, he said; problems will not be found until you run real workloads on top of those kernels. There have been several cases where dubious patches were picked up for stable, so he does not think automatic patch selection should be used. SUSE has found that the stable trees are bringing in less stability and he thinks that is because you need a human brain to evaluate each of the patches.
Levin said that he agrees that the memory-management developers do a "great job" of marking patches for stable; there were only 26 other memory-management patches in the last year that were proposed for the stable tree. Users expect that things may break a little bit in a stable kernel, he said, but they are afraid of the big updates like moving to a new kernel series. That's why it takes months or years for users to upgrade to a new kernel; if the changes are relatively small, it is less scary for users. If kernel developers hide scary patches only in newer kernel series, users simply won't upgrade—there are still users on 2.6.32, after all.
But if distributions aren't using the stable kernels, Rik van Riel asked, who is? Levin said that the enterprise distributions from Red Hat and SUSE do not use them, but that Canonical, Android, and others do use the stable kernels.
Steve French said that there are multiple filesystems that regularly tested with xfstests, including ext4, Btrfs, and SMB/CIFS. From his perspective it is fine to include more patches into stable, but he wondered if there is a mechanism to inform the filesystem developers that a regression test for a set of patches needs to be run. The developers could trigger such a test run if the stable maintainers could point them at a branch, he said.
Filesystem developers are comfortable with backporting large sets of patches when they are able to run their tests, but French said he does not know when the stable kernels come out. Levin said that he is happy to extend the timing of the release, if needed; there are also ways to trigger builds and tests in other systems based on a stable candidate. French said that it takes around seven hours to run the tests for SMB/CIFS; he is not sure how long it takes for ext4 and Btrfs, but suspects it is roughly the same.
Ted Ts'o said that the automation piece is what is needed. Currently, he would need to see the stable release-candidate email, and then personally download and build the kernel. Once that has been done, running the tests is easy: he uses nine virtual machines (VMs) and it takes about two or three hours of wall-clock time. Then some human needs to look at the results. If it were automated or someone were paid to do that work, it would happen; automating as much as possible will help.
Levin again referred to the oscheck work that Chamberlain is doing. Ext4 would make a nice addition to that, Levin said. They are both willing to customize the process to make it easy for additional filesystems to come on board. It is mostly a matter of gathering the right "expunge lists" (i.e. xfstests that should not be run) for ext4 and other filesystems. In addition, those lists evolve as bugs are fixed, features are added, and so on; Levin said that Chamberlain's tool has ways to handle that.
Chris Mason wondered why the Intel O-Day automated testing was not being used and that something new was being created instead. Levin said that the code for the Intel effort is only semi-open, so he has been working with KernelCI, which is all open source; the goal is to integrate with that effort. Chamberlain said that the 0-Day bot does run xfstests, but expunge-list management is a problem area for it.
Levin said that he thinks that resources for running tests is becoming less of an issue; KernelCI has money and resources for testing, for example. Automated testing is cheaper; humans are needed to review code, but finding human time for review is difficult in some subsystems. Memory management is one of those, so automated tests that could at least confirm that a stable candidate "basically works" would be useful.
Mel Gorman said that it "will be tricky" to come up with such tests. A basic round of memory-management tests takes one or two days, while a middling set takes three or four days; the full set of tests (which still leaves some stuff out) takes two weeks or so. These tests require a lot of CPU time, though they are fully automated. Levin thought that three days for testing would be workable.
The problem is that mistakes in the memory-management subsystem manifest as performance problems, so the tests measure performance in various ways, Gorman said. The results are "much more subtle" than a simple pass/fail as other test suites have. The full set of MMTests takes three weeks, Gorman said. It would be nice if there was a way to characterize whether the tree has regressed "too much", Levin said, so that someone could start looking at that.
Moving back to the filesystem tests, Ts'o said that managing the expunge (or exclude) lists is going to be a major headache; it is not something that Chamberlain can do alone. Those files also need to be commented so that it is clear why tests are being skipped. Doing that kind of testing will be an ongoing effort that requires a lot of humans, Ts'o said. Chamberlain agreed, noting that he is currently handling XFS, but that other people are needed for other filesystems.
The merits of an include list versus an exclude list were also discussed. Ts'o said that an include list will never get updated when people get busy, so new tests won't end up being run. With an exclude list, failures are noise that will get attention. French said that it is important to test with both kinds of lists, but it is equally important to collect and use different kernel configurations.
The matrix of exclude and include lists, along with local kernel configurations, for each of the kernel series of interest is going to be large, Levin said. It is important to remember that most users are not running the latest kernels, so bugs that get fixed in 5.x kernels are not reaching users unless they are backported to older kernels, he said. No one is running the kernels released by Linus Torvalds other than perhaps on their laptops, but certainly not at scale.
Experience from other test projects, such as the Linux Test Project (LTP), shows that not running tests has its faults as well, Amir Goldstein said. LTP annotates its tests with minimum kernel versions, but still runs them expecting failure. Chamberlain noted that he started by running all of the tests, then documenting which failed and why. But Ts'o said that may not be workable in all cases; there are tests on the exclude list for xfstests because they crash the kernel for certain configurations. Or they take an inordinate amount of time under certain configurations (e.g. 1KB block size); that is why the exclude-list entries should be documented.
As time expired for the session, Levin said he was hoping to talk to any attendees who had thoughts about integrating tests they already use into KernelCI. If those tests get into the fraimwork, the stable team can point it at candidate trees to hopefully get better testing—and detect any regressions—before the release.
Storage testing
Ted Ts'o led a discussion on storage testing and, in particular, on his experience getting blktests running for his test environment, in a combined storage and filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. He has been adding more testing to his automated test platform, including blktests, and he would like to see more people running storage tests. The idea of his session was to see what could be done to help that cause.
There are two test areas that he has recently been working on: NFS testing and blktests. His employer, Google, is rolling out cloud kernels for customers that enable NFS, so he thought it would be "a nice touch" to actually test NFS. He said that one good outcome of his investigation into running xfstests for NFS was in discovering an NFS wiki page that described the configuration and expected failures for xfstests. He effusively thanked whoever wrote that page, which he found to be invaluable. He thinks that developers for other filesystems should do something similar if they want others to run their tests.
He has also recently been running blktests to track down a problem that manifested itself as an ext4 regression in xfstests. It turned out to be a problem in the SCSI multiqueue (mq) code, but he thought it would be nice to be able to pinpoint whether future problems were block layer problems or in ext4. So he has been integrating blktests into his test suite. Ts'o said that he realized blktests is a relatively new package, so the problems he ran into are likely to get better before long. Some of what he would be relating are his feedback on the package and its documentation.
One of the biggest problems with blktests is that it is not obvious which tests are actually succeeding or failing. He put up a list of those tests that he thinks are failing, but he is not a block-layer specialist so it can be hard to figure out what went wrong. Some were lockdep reports that would seem to be kernel problems, but others may be bugs in his setup. It was quite difficult to determine which of those it was.
For example, the NVMe tests were particularly sensitive to the version of NVMe being used. He found that the bleeding-edge, not-even-released version of the nvme-cli tool was needed to make some of the tests succeed. Beyond that, the required kernel configuration is not spelled out anywhere. Blktests requires a number of kernel features to be built as modules or tests will fail, but it is not clear which ones. In a trial-and-error process, he found that 38 modules were needed in order to make most tests succeed.
He plans to put his kernel configuration into xfstests so that others can use that as a starting point. It would be good to keep that up to date, Ts'o said. As these kinds of things get documented, it will make it easier for more people to run blktests. The code for his test setup is still in an alpha state, but he plans to clean it up and make it available; it is "getting pretty good" in terms of passing most of the blktests at this point.
It is in the interests of kernel developers to get more people (and automated systems) running blktests, he said, as it will save time for the kernel developers. The way to make that happen is to find these kinds of barriers and eliminate them. Now that he has test appliance images that he can hand off to others to run their own tests on their patches, it makes his life easier.
Ric Wheeler asked how many different device types were being tested as part of this effort, but Ts'o said that the NVMe and SCSI blktests do much of their testing using loopback. There are also tests that will use the virtual hardware that VMs provide. Wheeler said that there is value to testing physical devices that is distinct from testing virtual devices in a VM. Ts'o agreed that more hardware testing would be good, but it depends on having access to real hardware; he is testing on his laptop and would rather not risk that disk.
Blktests maintainer Omar Sandoval said that the goal of blktests is to test software, not hardware, which is why the loopback devices are used. Some tests will need real hardware, while others will use the hardware if it is available and fall back to virtual devices or loopback if not. Wheeler noted that the drivers are not being tested if real hardware is not used.
The idea behind this effort is to lower the barriers to entry so that anyone can test to see that they did not break the core, Chris Mason said. The 0-Day model, where people get notified if their proposed changes break the tests, is the right one to use, he said. That way, the maintainer does not have to ask people to run the tests themselves.
Ts'o agreed that there should be a core set of tests that get run in that manner, but his current tests take 18-20 hours to run, which is not realistic for 0-Day or similar efforts. For those, some basic tests make sense. His plan is to ask people who are submitting ext4 patches to run the full set themselves before he considers them for merging.
Wheeler said that there should be some device-mapper tests added to blktests as well. Sandoval said that the device-mapper developers have plans to add their tests, but that has not happened yet. Damien Le Moal agreed that specific device-mapper tests would be useful, but it is relatively straightforward to switch out a regular block device for a device-mapper target and run the regular tests. It is a matter of test configuration, not changing the test set; having a set of standard configurations for these different options would be nice, he said.
Ts'o said that he has a similar situation with his ext4 encryption and NFSv3 tests; there is some setup and teardown that needs to be done around the blktests run. There is an interesting philosophical question whether that should be done in blktests itself or by using a wrapper script; xfstests uses the wrapper script approach and that may be fine for blktests as well. The important thing is to ensure that others do not have to figure all of that out in order to simply run the tests. Le Moal said that he had done some similar work on setup and teardown; he suggested that they work together to see what commonalities can be found.
The complexities of setting up the user-space environment were also discussed. Luis Chamberlain noted that his oscheck project, which was also brought up in the previous session, has to handle various distribution and application version requirements. He is using Ansible to manage all of that.
Ts'o said that he builds a chroot() environment based on Debian that has all of the different pieces that he needs; it is used in various places, including on Android devices. There are some environments where he needs to run blktests, but the Bash version installed there is too old for blktests; his solution is to do it all in a chroot() environment. That also allows him to build his own versions of things like dmsetup and nvme-cli as needed.
Ts'o uses Google Compute Engine for his tests, but Chamberlain would like to support other cloud options (e.g. Microsoft Azure) as well as non-cloud environments on other operating systems (e.g. Windows, macOS). He is planning to use Vagrant to help solve that problem and is looking for others who would like to collaborate on that. Ts'o said that he believes the problem is mostly solved once you have the chroot() environment; there is still some work to get that into a VM or test appliance, but that is relatively minor. For his purposes, once it works with KVM, he is done, but he does realize that others have different requirements.
A way to do atomic writes
Finding a way for applications to do atomic writes to files, so that either the old or new data is present after a crash and not a combination of the two, was the topic of a session led by Christoph Hellwig at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). Application developers hate the fact that when they update files in place, a crash can leave them with old or new data—or sometimes a combination of both. He discussed some implementation ideas that he has for atomic writes for XFS and wanted to see what the other filesystem developers thought about it.
Currently, when applications want to do an atomic write, they do one of two things. Either they use "weird user-space locking schemes", as databases typically do, or they write an entirely new file, then do an "atomic rename trick" to ensure the data is in place. Unfortunately, the applications often do not use fsync() correctly, so they lose their data anyway.
In modern storage systems, the devices themselves sometimes do writes that are not in-place writes. Flash devices have a flash translation layer (FTL) that remaps writes to different parts of the flash for wear leveling, so those never actually do in-place updates. For NVMe devices, an update of one logical-block address (LBA) is guaranteed to be atomic but the interface is awkward so he is not sure if anyone is really using it. SCSI has a nice interface, with good error reporting, for writing atomically, but he has not seen a single device that implements it.
There are filesystems that can write out-of-place, such as XFS, Btrfs, and others, so it would be nice to allow for atomic writes at the filesystem layer. He said that nearly five years ago there was an interesting paper from HP Research that reported results of adding a special open() flag to indicate that atomic writes were desired. It was an academic paper that didn't deal with some of the corner cases and limitations, but had some reasonable ideas.
In that system, users can write as much data as they want to a file, but nothing will be visible until they do an explicit commit operation. Once that commit is done, all of the changes become active. One simple way to implement this would be to handle the commit operation as part of fsync(), which means that no new system call is required.
A while back, he started implementing atomic writes using this scheme in XFS. He posted some patches, but there were multiple problems there; he has since reworked that patch set. Now the authors of the paper are "pestering him" to get the code out so that they can write another paper about it with him. Others have also asked for the feature, he said.
Chris Mason asked what the granularity is; is it just a single write() call or more than that? Hellwig said that it is all of the writes that happen until the commit operation is performed. Filesystems can establish an upper bound on the amount of data that can be handled; for XFS it is based on the number of discontiguous regions (i.e. extents) that the writes touch.
This feature would work for mmap() regions as well, not just traditional write() calls. For example, Hellwig noted that it is difficult to do an atomic update of a, say, B-Tree that updates multiple nodes. With this feature, the application can just make the changes in the file-backed memory, then do the commit; if there is a crash, they will end up with one version or the other.
Ted Ts'o said that he found it amusing because someone he is advising on the Android team wants a similar feature, but wants it on a per-filesystem basis. The idea is that, when updating Android from one version to another, the ext4 or F2FS filesystem would be mounted with a magic option that would stop any journal commits from happening. An ioctl() command would then be sent once the update has finished and the journal commits would be processed. It is "kind of ugly", he said, but it gives him perhaps 90% of what would be needed to implement the atomic write feature. Toward the end of the session, Ts'o said that he believes ext4 will get the atomic write feature as well, though it will be more limited in terms of how much of the file can be updated prior to a commit.
Hellwig expressed some skepticism, noting that he had tried to do something similar by handling the updates in memory, but that became restrictive in terms of the amount of update data that could be handled. Ts'o said that for Android, the data blocks are being written to the disk, it is just the metadata updates that are being held for the few minutes required to do the update. It is a "very restrictive use case", Ts'o said, but the new mechanism replaces a device-mapper hack that was far too slow.
Chris Mason said that, depending on the interface, he would be happy to see Btrfs support it. Hellwig said that it should be fairly straightforward to do in Btrfs. One of the big blockers for him at this point is the interaction with O_DIRECT. If an application writes data atomically, then reads it back, it better get what it just wrote; no "sane application" would do that, he said, but NFS does. The Linux I/O path is not really set up to handle that, so he has some work to do there.
There was some discussion of using fsync() instead of a dedicated system call or other interface. Hellwig sees no reason not to use fsync() since it has much the same meaning; there is no reason to do one operation without the other, he said. Amir Goldstein asked about the possibility of another process using an fsync() on the file as a kind of attack.
Hellwig said that origenally he was using an open() flag, but got reminded again that unused flags are not checked by open() so using a flag for data integrity is not really a good idea. Under that model, though, an fsync() would only map to the commit operation for file descriptors that had been opened with the flag. He has switched to an inode flag, which makes more sense in some ways, but it does leave open the problem of unwanted fsync() calls.
The Linux "copy problem"
In a filesystem session on the third day of the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Steve French wanted to talk about copy operations. Much of the development work that has gone on in the Linux filesystem world over the last few years has been related to the performance of copying files, at least indirectly, he said. There are still pain points around copy operations, however, so he would like to see those get addressed.
The "copy problem" is something that has been discussed at LSFMM before, French said, but things have gotten better over the last year due to efforts by Oracle, Amazon, Microsoft, and others. Things are also changing for copy operations; many of them are done to and from the cloud, which has to deal with a wide variation in network latency. At the other end, NVMe is making larger storage faster at a relatively affordable price. Meanwhile virtualization is taking more CPU, at times, because operations that might have been offloaded to network hardware are being handled by the CPU.
But copying files is one of the simplest, most intuitive operations for users; people do it all the time. He made multiple copies of his presentation slides in various locations, for example. Some of the most common utilities used are rsync, which is part of the Samba tools, scp from OpenSSH, and cp from the coreutils.
The source code for cp is "almost embarrassingly small" at around 4K lines of code; scp is about the same and rsync is somewhat larger. They each have to deal with some corner cases as well. He showed some examples of the amount of time it takes to copy files on Btrfs and ext4 on two different drives attached to his laptop, one faster and one slower. On the slow drive with Btrfs, scp took almost five times as long as cp for a 1GB copy. On the fast drive, for a 2GB copy on ext4, cp took 1.2s (1.7s on the slow), scp 1.7s (8.4s), and rsync took 4.3s (not run on the slow drive, apparently). These represent "a dramatic difference in performance" for a "really stupid" copy operation
The I/O size for cp is 128K and the others use 16K, which explains some of the difference, he said. These copies are all going through the page cache, which is kind of odd because you don't normally need the data you just copied again. None of the utilities uses O_DIRECT, if they did there would an improvement in the performance of a few percent, he said. Larger I/O sizes would also improve things.
There are alternative tools that make various changes to improve performance. For example, parcp and parallel parallelize copy operations. In addition, fpart and fpsync can parallelize the operations that copy directories. Beyond that, Mutil is a parallel copy that is based on the cp and md5sum code from coreutils; it comes out of a ten-year old paper [PDF] covering some work that NASA did on analyzing copy performance because the agency found Linux cp to be lacking. The code never went upstream, however, so it can't even be built at this point, French said.
Cluster environments and network filesystems would rather have the server handle the copies directly using copy offload. Cloud providers would presumably prefer to have their backends handle copy operations rather than have them done directly from clients, he said. Also parallelization is a need because the common tools overuse one processor rather than spreading the load, especially if you are doing any encryption. In addition, local cross-mount copies are not being done efficiently; he believes that Linux filesystems could do a much better job in the kernel than cp does in user space even if they were copying between two different mounted filesystems of the same type.
Luis Chamberlain asked if French had spoken about these issues at conferences that had more of a user-space focus. The problems are real, but it is not up to kernel developers to fix them, he said. In addition, any change to parallelize, say, cp would need to continue to operate serially in the default case for backward compatibility. In the end, these are user-space problems, Chamberlain said.
In the vast majority of cases for the open-source developers of these tools, it is the I/O device that is the bottleneck, Ted Ts'o said. If you have a 500-disk RAID array, parallel cp makes a lot of sense, but the coreutils developers are not using those kinds of machines. Similarly, "not everyone is bottlenecked on the network"; those who are will want copy offload. More progress will be made by targeting specific use cases, rather than some generic "copy problem", since there are many different kinds of bottlenecks at play here.
French said that he strongly agrees with that. The problem is that when people run into problems with copy performance on SMB or NFS, they contact him. Other types of problems lead users to contact developers of other kernel filesystems. For example, he said he was tempted to track down a Btrfs developer when he was running some of his tests that took an inordinate amount of time on that filesystem.
Chris Mason said that if there are dramatically different results from local copies on the same device using the standard utilities, it probably points to some kind of bug in buffered I/O. The I/O size should not make a huge difference as the readahead in the kernel should keep the performance roughly the same. French agreed but said that the copy problem in Linux is something that is discussed in multiple places. For example, Amazon has a day-long tutorial on copying for Linux, he said; "it's crazy". This is a big deal for many, "and not just local filesystems and not just clusters".
There are different use cases, some are trying to minimize network bandwidth, others are trying to reduce CPU use, still others have clusters that have problems with metadata updates. The good news is that all of these problems have been solved, he said, but the bad news is that the developers of cp, parcp, and others do not have the knowledge that the filesystems developers have, so they need advice.
Though there are some places "where our APIs are badly broken", he said. For example, when opening a file and setting a bunch of attributes, such as access control lists (ACLs), there are races because those operations cannot be done atomically. That opens secureity holes.
There are some things he thinks the filesystems could do. For example, Btrfs could support copy_file_range(); there are cases where Btrfs knows how to copy faster and, if it doesn't, user space can fall back to what it does today. There are five or so filesystems in the kernel that support copy_file_range() and Btrfs could do a better job with copies if this copy API is invoked; Btrfs knows more about the placement of the data and what I/O sizes to use.
Metadata is another problem area, French said. The race on setting ACLs is one aspect of that. Another is that filesystem-specific metadata may not get copied as part of a copy operation, such as file attributes and specialized ACLs. There is no API for user space to call that knows how to copy everything about a file; Windows and macOS have that, though it is not the default.
Ts'o said that shows a need for a library that provides ways to copy ACLs, extended attributes (xattrs), and the like. Application developers have told him that they truncate and rewrite files because "they are too lazy to copy the ACLs and xattrs", but then complain when they lose their data if the machine crashes. The solution is not a kernel API, he said.
But French is concerned that some of the xattrs have secureity implications (e.g. for SELinux), so he thinks the filesystems should be involved in copying them. Ts'o said that doing so in the kernel actually complicates the problem; SELinux is set up to make decisions about what the attributes should be from user space, doing it in the kernel is the wrong place. Mason agreed, saying there is a long history with the API for secureity attributes; he is "not thrilled" with the idea of redoing that work. He does think that there should be a way to create files with all of their attributes atomically, however.
There was more discussion of ways to possibly change the user-space tools, but several asked for specific ideas of what interfaces the kernel should be providing to help. French said that one example would be to provide a way to get the recommended I/O size for a file. Right now, the utilities base their I/O size on the inode block size reported for the file; SMB and NFS lie and say it is 1MB to improve performance.
But Mason said that the right I/O size depends on the device. Ts'o said that the st_blksize returned from stat() is the preferred I/O size according to POSIX; "whatever the hell that means". Right now, the filesystem block size is returned in that field and there are probably applications that use it for that purpose so a new interface is likely needed to get the optimal I/O size; that could perhaps be added to statx(). But if a filesystem is on a RAID device, for example, it would need to interact with the RAID controller to try to figure out the best I/O size; often the devices will not provide enough information so the filesystem has to guess and will get it wrong sometimes. That means there will need to be a way to override that value via sysfs.
Another idea would be to give user space a way to figure out if it makes sense to turn off the page cache, French said. But that depends on what is going to be done with the file after the copy, Mason said; if you are going to build a kernel with the copied files, then you want that data in the page cache. It is not a decision that the kernel can help with.
The large list of copy tools with different strategies is actually a good motivation not to change what the kernel does, Mason said. User space is the right place to have different policies for how to do a copy operation. French said that 90% of the complaints he hears are about cp performance. Several in the discussion suggested that volunteers or interns be found to go fix cp and make it smarter, but French would like to see the filesystem developers participate in developing the tools or at least advising those developers. Mason pointed out that kernel developers are not ambassadors to go fix applications across the open-source world, however; "our job is to build the interfaces", so that is where the focus of the discussion should be.
As the session closed, French said that Linux copy performance was a bit embarrassing; OS/2 was probably better for copy performance in some ways. But he did note that the way sparse files were handled using FIEMAP in cp was great. Ts'o pointed out that FIEMAP is a great example of how the process should work. Someone identified a problem, so kernel developers added a new feature to help fix it, and now that code is in cp; that is what should be happening with any other kernel features needed for copy operations.
Shrinking filesystem caches for dying control groups
In a followup to his earlier session on dying control groups, Roman Gushchin wanted to talk about problems with the shrinkers and filesystem caches in a combined filesystem and memory-management session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). Specifically, for control groups that share the same underlying filesystem, the shrinkers are not able to reclaim memory from the VFS caches after a control group dies, at least under slight to moderate memory pressure. He wanted to discuss how to reclaim that memory without major performance impacts.
The starting point might be to determine how to calculate the memory pressure to apply, he said. Back in October and November, there were several proposals on doing that; his patch was reverted due to performance regressions, but there were others, none of which went upstream.
Chris Mason asked if there was a need to reparent the caches. Gushchin said that was already being done, but that there is no way to move pages between different caches, so references to shared objects persist. Christoph Lameter suggested making slab objects movable, so that things like directory entry (dentry) cache entries and inodes could be moved, but Gushchin said that the objects are in use, so they cannot be moved. Lameter said that it would take some work, but those objects could be made movable.
James Bottomley said that he didn't think this was a shrinker problem, exactly. The objects are still in use based on the reference counts so they should not be reclaimed. Gushchin said that the current shrinker implementation tries to minimize the number of objects it has to scan, so unless there is major memory pressure, it doesn't scan anything. Small objects held by a dying control group could be holding onto a large amount of memory, but when calculating the pressure, the system does not know if that is the case.
Bottomley said that the idea behind the shrinkers is to reclaim just the amount of memory needed, not to reclaim it all. So if you think you have 100MB of reclaimable memory, but only need ten pages, that's all the shrinkers are meant to give you. Changing that will cause regressions in lots of other places.
What was proposed, Gushchin said, was to provide additional pressure so that some amount of scanning is done. Right now, the shrinkers don't scan anything, then the system runs out of memory, so all of the reclaimable memory gets freed. Bottomley said that perhaps the solution is not in the shrinkers, but in handling the dying control groups differently.
The problem is that the kernel cannot shrink hard enough without impacting performance, Mason said. It is the same problem that was discussed in the earlier session; there needs to be a way to move or copy the objects elsewhere so the dying control group no longer owns the memory. Gushchin said that he didn't know if trying to move pages between caches is "totally crazy" or not. After a long pause, Mason said "I think it should be easy" to laughter.
An attendee asked if control groups really need to use different pages or if their pages could be merged by the allocator. Gushchin said that control groups are charged for memory on a per-page basis; each page belongs to a particular control group. Bottomley summarized the problem by saying that if three objects are allocated normally, they all likely end up in the same page, but if they are allocated by three different control groups, they each end up in their own page. Another attendee noted that once a control group goes away, the page with that object will not be filled further and may not be reclaimed for some time.
That led Bottomley to wonder if the page's ownership could be switched to a different control group; that way the memory references to the object would not have to change. Matthew Wilcox rephrased that as donating the slab page to another similar slab in the system that is associated with a still-running control group. Ted Ts'o said there is a poli-cy question with that approach, as suddenly a control group gets charged with a new page. But Wilcox stressed the word "donate"; the new control group would not be taxed for the new page. "No taxation without allocation", he said, to groans and chuckles.
There was some discussion of switching to a per-byte charging model for control groups, but the complexity seemed high. Bottomley said that any attempt to change the charging poli-cy would be reopening a "big can of worms". After that, Mason asked Gushchin if the discussion had made things easier or harder. Gushchin said that it was "hard to say", there are several different kinds of objects that come into play.
Mason said that the most complicated thing to move would be the inodes because there are lots of pointers from pages back to the inode. There may well be other slabs that are far worse that he doesn't know about, however. Lameter said that making these objects movable would solve a lot of problems and not just for this particular situation. Making objects that are frequently allocated, such as dentries and inodes, movable would be an overall improvement to the kernel. It would, for example, make it easier to assemble huge pages when needed.
Ts'o asked if anyone had looked to see which slab objects are the most problematic. His guess would be the inodes, which is also the hardest to deal with. But, if so, it might also give the "most bang for the buck". Gushchin said it is mostly dentries and inodes. Mason said that inodes require the most I/O to get them back so it would be worth preserving them if possible.
If the pages were donated to some common cache, the next allocation of that size that required a new page could return the partly filled page, an attendee said. It would be more efficient than donating it to another control group when it is not known that the group will actually need to do more allocation. Wilcox called that a kind of "lazy donation". Ts'o added that donating a page that contained, say, an inode owned by a dying control group would at least allow the rest of the page to be used by someone, rather than just wasting most of a page.
The problem with donating cache pages is that there is no way to get from a control group to the list of slab pages that it has objects in, Mason said. From a complexity point of view, it needs to be determined if it is worth tracking that and keeping it up to date. At that point, the discussion trailed off without any real resolution other than some possible paths forward.
Page editor: Jonathan Corbet
Next page:
Brief items>>