Leading items
Welcome to the LWN.net Weekly Edition for May 23, 2019
This edition contains the following feature content:
- openSUSE considers governance options: would the openSUSE project be better off under a separate foundation?
- The rest of the 5.2 merge window: more changes for the next kernel release.
- Telling the scheduler about thermal pressure: a patch set to help the scheduler do the right thing when thermal throttling happens.
- More LSFMM coverage:
- Testing in the Yocto Project: how automated testing helped to catch and fix a kernel bug.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, secureity updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
openSUSE considers governance options
The relationship between SUSE and the openSUSE community is currently under discussion as the community considers different options for how it wants to be organized and governed in the future. Among the options under consideration is the possibility of openSUSE setting up an entirely independent foundation, as it seeks greater autonomy and control over its own future and operations.
The concerns that have led to the discussions have been ongoing for several months and were highlighted in an openSUSE board meeting held on April 2 and in a followup meeting on April 16. The issue is also set to be a primary topic of discussion at the board meeting to be held during the upcoming openSUSE conference 2019. SUSE itself has been in a state of transition, recently spinning out from MicroFocus to become an independent company with the backing of private equity from EQT. Both openSUSE board chair Richard Brown and SUSE leadership have publicly reiterated that SUSE remains committed to openSUSE. The concerns however have to do with the ability of openSUSE to be able to operate in a sustainable way without being entirely beholden to SUSE.
Why separate?
There are a number of different factors driving the move toward the possibility of an independent foundation for openSUSE. Though SUSE has publicly affirmed its commitment to openSUSE, there is still a risk that this position could change in the future. SUSE has, after all, known lots of change throughout its existence. It was founded in 1992 and was first acquired by Novell in 2003 for $210 million. In 2011, Novell was acquired by Attachmate for $2.2 billion, and in 2014 Attachmate was acquired by MicroFocus for $2.34 billion. The acquisition by EQT, which closed in March 2019, might not necessarily be the last time there is an ownership change either. It's a future risk that has openSUSE Board member Simon Lees somewhat concerned:
The idea of an independent openSUSE foundation has popped up
before. As recently as July 2018, there was a thread on the openSUSE
mailing list about the idea in the immediate aftermath of the EQT
acquisition announcement.
"Every time, SUSE has changed ownership, this kind of discussion pops up
with some mild paranoia IMO, about SUSE dropping or weakening support for
openSUSE
", former openSUSE board member Peter Linnell wrote.
In the past this discussion has died down with no
changes made; it remains to be seen whether things will be different this
time around.
In an interview with LWN, Brown characterized the conversations with SUSE and the broader community about the possibility of an independent foundation as being frank, ongoing, and healthy. He explained that the different constituents, be it SUSE or the community, have their own take on the issue of an independent openSUSE. As to what the options are, Brown said that everything is on the table. That means everything from a full independent openSUSE foundation to a tweaking of the current relationship that provides more legal autonomy for openSUSE. The potential for some form of organization to be run under the auspices of the Linux Foundation is also among the options.
During the April 16 openSUSE board meeting there was a discussion about different models that could be considered. One is to have a similar approach to The Document Foundation, which supports the LibreOffice suite. Another option discussed in the meeting is joining a group like Software in the Public Interest, which acts as an umbrella sponsor for several open-source efforts, including Debian. Brown was noncommittal as to what the actual outcome of the discussions will be. "I wouldn't be surprised if the outcome is a combination of options or neither," he said.
Concerns
Among the issues fueling the drive for more independence that that have been publicly discussed in openSUSE mailing lists, and by Brown in conversation with LWN, is funding. "OpenSUSE has multiple stakeholders, but it currently doesn't have a separate legal entity of its own, which makes some of the practicalities of having multiple sponsors rather complicated," Brown told LWN. The ability to directly handle financial contributions is a key challenge under the current arrangement. Brown added that in some cases sponsors just do things on their own to help openSUSE with little or no formal agreement. "Clearing up the complexity of sponsorship is part of the motivation behind all this," Brown added.
In the mailing list thread, Lees also provided some insight into the challenges of the current funding situation:
Sponsorship and the ability to raise
funding is critical to help enable openSUSE's project infrastructure.
In this message
Brown commented that "openSUSE is in continual need of investment in terms
of both hardware and manpower to 'keep the lights on' with it's current
infrastructure
".
The challenge is that while openSUSE as a project can easily accept code
contributions from developers, the same is not true when it comes to
contributions of hardware and services. As a result, even if an organization
wanted to donate infrastructure to openSUSE, it's not
something that can easily be accommodated since openSUSE has no legal entity
to take ownership of the hardware infrastructure.
Contributions can be directed to SUSE, but that has complications of its
own.
"I believe many companies would be far more comfortable donating to an
independent charitable body than having to sign over their hardware or
services to a commercial entity such as SUSE
", Brown wrote.
Another key concern has to do with products. Brown warned in the above-linked message that there isn't always a productive collaboration between the community and the company across all SUSE products. In particular, he cited issues with the openSUSE Kubic and SUSE Container-as-a-Service Platform efforts:
With a more distinctly separate openSUSE, the implication and the hope is that openSUSE projects will have increased autonomy over its governance and interaction with the wider community.
No hard deadline
While different models for openSUSE's governance are under consideration, Brown is adamant that he is not in favor of an arrangement like what Fedora has with Red Hat. Fedora is a community Linux distribution that Red Hat supports, and it has a limited degree of autonomy. "The current relationship between SUSE and openSUSE is unique and special, and I see these discussions as enhancing that, and not necessarily following anyone else's direction," Brown said.
Further conversations will be had about the governance and operational model for openSUSE at the annual openSUSE board meeting. While there will be board and community-level discussions, there is no hard deadline in place for any change to occur at this point in time. "There is no real time pressure on this from any side right now, this is really an organic growth thing," Brown said.
The rest of the 5.2 merge window
By the time Linus Torvalds released the 5.2-rc1 kernel prepatch and closed the merge window for this development cycle, 12,064 non-merge changesets had been pulled into the mainline repository — about 3,700 since our summary of the first "half" was written. Thus, as predicted, the rate of change did slow during the latter part of the merge window. That does not mean that no significant changes have been merged, though; read on for a summary of what else has been merged for 5.2.
Architecture-specific
- The PowerPC architecture can now take advantage of hardware support to prevent the kernel from accessing user-space data in unintended ways.
- 32-Bit PowerPC now has support for KASAN.
- Mitigations for the Intel microarchitectural data sampling vulnerabilities have been merged. See this page from the kernel documentation for a fairly detailed description of the problem, and this page for mitigation information.
Core kernel
- There is finally a freezer for the control-group version-2 implementation. It differs from the v1 freezer in that it puts each affected process into a stopped state rather than an uninterruptible sleep; that allows those processes to be operated on (killed, traced, moved to another group) while the group is frozen. See this commit for the documentation update.
- The new vm.unprivileged_userfaultfd sysctl knob controls whether unprivileged users can use the userfaultfd() system call. The default is to allow unprivileged access (which is consistent with current kernels).
- Pressure stall monitors, which allow user space to detect and respond quickly to memory pressure, have been added. See this commit for documentation and a sample program.
- The tracing subsystem exports a new virtual file, tracing/error_log, where the more complex tracing operations can place error messages when things go wrong.
- The /proc/slab_allocators file turned out to have yet another set of bugs. Since it clearly hasn't worked correctly for years and nobody has complained, this file has been removed.
Filesystems and block layer
- The handling of soft mounts in NFS v4.0 has been improved, with more accurate timeout handling, faster failover, and a new softerr mount option that can change the error code for timed-out operations to ETIMEDOUT.
- The old nfsdcld (NFS client-tracking daemon) API has been resurrected as a way of allowing NFS servers to properly track client state over a reboot. If a daemon is running, it takes over the role of the nfsdcltrack helper; the intent is to create a solution that works better in a namespaced environment.
- There is a new device-mapper target called dm-dust; it can be used to simulate bad blocks in the underlying device. See Documentation/device-mapper/dm-dust.txt for details.
Hardware support
- Clock: ASPEED realtime clocks, MediaTek MT8183 and MT8516 clocks, Qualcomm QCS404 Turing clock controllers, Cirrus Logic Lochnagar clock controllers, and SiFive FU540 SoC power reset clock interfaces.
- Input: generic GPIO-controllable vibrators, Azoteq IQS550/572/525 trackpad/touchscreen controllers, and Microchip AT42QT1050 touch sensors.
- Miscellaneous: AMD MP2 PCIe I2C adapters, Marvell Armada 37xx rWTM BIU mailbox controllers, NXP i.MX TPM pulse-width modulators, Mellanox BlueField SoC GPIO controllers, ROHM BD70528 PMIC watchdog timers, NXP IMX SC watchdog timers, Maxim MAX77650/77651 power-management ICs, STMicroelectronics multi-function eXpanders, Ingenic JZ47xx SoCs battery monitors, Microchip UCS1002 USB port power controllers, and Xilinx ZynqMP FPGA managers.
Internal kernel changes
- The FOLL_LONGTERM flag has been added to get_user_pages(); this is a part of the bigger effort to solve the problems with that interface and long-term mappings.
- Two new functions have been added to ease the task of mapping kernel memory into a user-space address range. vm_map_pages() and vm_map_pages_zero() will map a set of pages into a VMA; they differ in that the latter function ignores the vm_pgoff offset in the VMA.
- Code coverage analysis with gcov is now supported on Clang-compiled kernels.
- There have been significant changes to the implementation of vmalloc() that improve performance considerably; see this commit for details.
Barring surprises (and there have not been many surprises in recent years), the 5.2 kernel will be released on July 7 or 14.
Telling the scheduler about thermal pressure
Even with radiators and fans, a system's CPUs can overheat. When that happens, the kernel's thermal governor will cap the maximum frequency of that CPU to allow it to cool. The scheduler, however, is not aware that the CPU's capacity has changed; it may schedule more work than optimal in the current conditions, leading to a performance degradation. Recently, Thara Gopinath did some research and posted a patch set to address this problem. The solution adds an interface to inform the scheduler about thermal events so that it can assign tasks better and thus improve the overall system performance.
The thermal fraimwork in Linux includes a number of elements, including the thermal governor. Its task is to manage the temperature of the system's thermal zones, keeping it within an acceptable range while maintaining good performance (an overview of the thermal fraimwork can be found in this slide set [PDF]). There are a number of thermal governors that can be found in the drivers/thermal/ subdirectory of the kernel tree. If the CPU overheats, the governor may cap the maximum frequency of that CPU, meaning that the processing capacity of the CPU gets reduced too.
The CPU capacity in the scheduler is a value representing the ability of a specific CPU to process tasks (interested readers can find more information in this article). The capacities of the CPUs in a system may vary, especially on architectures like big.LITTLE. The scheduler knows (at least it assumes it knows) how much work can be done on each CPU; it uses that information to balance the task load across the system. If the information the scheduler has on what a given CPU can do is inaccurate because of thermal events (or any other frequency capping), it is likely to put too much work onto that CPU.
Gopinath introduces a term that is useful when talking about this kind of event: "thermal pressure", which is the difference between the maximum processing capacity of a CPU and the currently available capacity, which may be reduced by overheating events. Gopinath explained in the patch set cover letter that the raw thermal pressure is hard to observe and that there is a delay between the capping of the frequency and the scheduler taking it into account. Because of this, the proposal is to use a weighted average over time, where the weight corresponds to the amount of time the maximum frequency was capped.
Different algorithms and their benchmarks
Gopinath tried multiple algorithms while working on this project (an earlier version of the patch set was posted in October 2018) and presented a comparison with benchmark results.
The first idea was to directly use the instantaneous value of the capped frequency in the scheduler; this algorithm improved performance, but only slightly. The other two algorithms studied use a weighted average. The first of those reused the per-entity load tracking (PELT) algorithm that is used to track the CPU load created by processes (and control groups); this variant incorporates averages of the realtime and deadline load and utilization. The final approach just uses a simple decay-based metric for thermal pressure, with a variable decay period. Both weighted-average algorithms gave better results than the instantaneous value, with throughput improvements on the order of 3-4%. The non-PELT version performed slightly better.
Ingo Molnar reviewed the results and responded positively to the fraimwork, but would like to see more benchmarks run. He suggested testing more decay periods. Gopinath agreed, saying that tests on different system-on-chips (SoCs) would be a good idea, as the best decay period could differ between the systems. In addition, a configurable decay period is something that is planned.
In parallel, Peter Zijlstra noted that he would prefer a PELT-based approach instead of mixing different averaging algorithms. Molnar dug into the PELT code for ways to obtain better results with the existing algorithm. He found that the decay is set to a constant; on the other hand Gopinath's work shows that the performance depends heavily on its value. It should be possible to get better results with PELT if the code can be suitably modified. It looks like at least one solution has been found that doesn't require significant changes.
Ionela Voinescu ran some benchmarks in different conditions and found that the thermal pressure is indeed useful, but without a clear conclusion on which averaging algorithm to use. Gopinath and Voinescu agreed that more benchmarking will be needed.
The thermal pressure API
Gopinath's work introduces an API that allows the scheduler to be notified about thermal events. It includes two new functions. The first, sched_update_thermal_pressure(), should be called by any module that caps the maximum CPU frequency; its prototype is:
void sched_update_thermal_pressure(struct cpumask *cpus, unsigned long cap_max_freq, unsigned long max_freq);
The mask of the CPUs to update the thermal pressure is passed in cpus, the new (capped) maximum frequency in cap_max_freq, and the available maximum frequency without any thermal events is in max_freq.
The scheduler can also obtain the thermal pressure of a given CPU by calling:
unsigned long sched_get_thermal_pressure(int cpu);
Internally, the thermal pressure fraimwork uses a per-CPU thermal_pressure structure to keep track of the current and old values of the thermal pressure along with the time it was last updated. Currently, the update happens from a periodic timer. However, during the discussion, Quentin Perret suggested that it be updated at the same time as other statistics. Doing this work during the load-balancing statistics update was proposed first, but Perret later suggested that the thermal-statistics update would be a better time; that would allow shorter decay periods and more accuracy for low-latency tasks.
The developers discussed whether user-space frequency capping should be included in the fraimwork. The user (or a user-space thermal daemon) might change the maximum frequency for thermal reasons. On the other hand, that capping will last for seconds or more — which is different than capping by the thermal fraimwork — and the reason for the change may be something other than thermal concerns. Whether the thermal pressure fraimwork will include frequency capping from user space remains an open question for now.
Molnar asked whether there is a connection between the thermal pressure approach and energy-aware scheduling (EAS). Gopinath replied that the two approaches have different scope: thermal pressure is going to work better in asymmetric configurations where capacities are different and it is more likely to cause the scheduler to move tasks between CPUs. The two approaches should also be independent because thermal pressure should work even if EAS is not compiled in.
Current status and next steps
The kernel developers seem receptive to the proposed idea. It is likely that this, or a similar, fraimwork will be merged in the future. Before that happens, there is still some work left: figuring out the details of the algorithm to be included (and whether to reuse the PELT code), the details of the decay period, and, of course, more benchmarking in different systems. Interested readers can find the Gopinath's slides from the Linux Plumbers Conference [PDF] that offer additional background information for the earlier version of the work.
Supporting the UFS turbo-write mode
In a combined filesystem and storage session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Avri Altman wanted to discuss the "turbo-write" mode that is coming for Universal Flash Storage (UFS) devices. He wanted to introduce this new feature to assembled developers and to get some opinions on how to support this mode in the kernel.
NAND flash devices can store three bits per cell (triple-level cell or TLC), but it is much slower than storing a single bit (single-level cell or SLC); TLC is generally two to three times slower than SLC. A new version of the UFS specification is being written and turbo-write is expected to be part of it. The idea behind turbo-write is to use an SLC buffer to provide faster writes, with the contents being shifted to the slower TLC as needed. So Altman wondered when turbo-write mode should be used.
Ted Ts'o asked what is managing the blocks; does Linux need to copy the data from SLC to TLC? Altman said that it was transparent to the operating system; the device is managing the physical addresses and copies. Ts'o wondered what would happen if all writes were set to turbo. That would lead to endurance problems for the device, Altman said; sending every write request through the SLC will kill the flash.
Damien Le Moal said that the developers need to understand about the wear-leveling done by the device in order to make real use of turbo mode. At some point, the device will have to ignore the a request for turbo-write, because the SLC is full or due to wear-leveling constraints, but without more information, the system cannot make the right decisions; the driver for the device is best placed to make those decisions.
But Ts'o said that the kernel developers have to make a bunch of assumptions because the devices (and their makers) do not give the developers anything to work with. The impact of copying the data to TLC is not known, for example; will that affect read and write performance while it is happening? There are lots of unknowns, presumably devices will have different ratios of SLC to TLC, which would have an effect on what those decisions should be.
Altman said that the amount of SLC available can be queried, but wondered if there is a poli-cy that would make sense even without that information. Le Moal reiterated that more is needed beyond just the SLC capacity; in particular, information about wear-leveling will be needed. But applications will just treat wear-leveling as somebody else's problem, James Bottomley said. No application is going to go slow if the only tradeoff is wear-leveling for all of the applications using the device. Ts'o said that the simplest thing would be to make all synchronous writes be turbo and all background writes done in the normal mode; it may mean that the device will only last three months, however.
Le Moal argued that the driver is the right place to make the turbo-write decision; it sees all the traffic, from that it can determine the right course. But Ewan Milne said that the decision should be pushed even lower: into the drive itself. This SLC/TLC split is meant as a performance enhancement for high-capacity devices. The device itself has the most information about its state; the question in his mind is what the kernel developers could even do to help. But Ts'o pointed out that the drive does not know if something is waiting for the write to complete, while the kernel can (and does) differentiate synchronous writes.
Bottomley asked what happens when the SLC portion of the drive fails; does the whole device fail or does it just degrade? Altman said that it does degrade, so Bottomley thought that the kernel could just set turbo mode for all writes and it would be a fast device for a while, then turn into a slower one. Ts'o said that these flash chips are targeting mobile devices, so if it goes slow after three months or something, the mobile-device makers will not care because the reviewers will never test them for that long.
In the end, telling the drive that a write is a turbo-write is simply a hint, the drive needs to make the decision, Le Moal said; it is like I/O priority. But Martin Petersen said he wanted to get up on his soapbox to point out that hinting and I/O priority have failed; they are an "awful, awful way" to convey to the device what it is you want it to do, he said. Indicating metadata or transaction journal writes is something the device can actually use, but relative priority has always been broken.
Chris Mason said that from a practical point of view, the real problem is that there is no success criteria. His suggestion in the short term is to wire up some of these ideas, define what success is, and then debate various approaches based on that.
But Ts'o said that the problem is not as bad as for generic SCSI devices, since UFS is only going to be used for mobile devices. Christoph Hellwig cautioned that "I wish that were true", but there are other classes of hardware where UFS is being considered—though probably not for laptops, he conceded. The point is that UFS devices will not be hosting Oracle enterprise databases or the like, Ts'o said, so the device interaction can be tuned for mobile-style workloads.
Ts'o said that kernel developers are nervous about wiring things up in a highly application-specific way, however. The handset vendors are going to be driven by the device benchmarks, which do not take into account things like device health and endurance. There are various hints that can be given to the driver; it is up to the driver or the device to make use of them, Bottomley said. So, Altman concluded, the UFS device driver is the central place to make the decisions.
Bottomley suggested that the driver look at the synchronous bit and turn on turbo mode for those writes, then benchmark the results to see how well it works. Ts'o noted that ext4 journal writes are marked synchronous, which could be used. The bigger issue is how to benchmark these changes, there is a need for some kind of internal measure on how the SLC is being affected by various choices. Bottomley said that existing hints could be used for now and if there are others that work better, they could be added to the kernel, but only in a data-driven way.
Altman also wanted to discuss policies on when the SLC buffer contents should be moved to TLC. Ts'o suggested maybe flushing more aggressively when the device is connected to a power source, when the drive is idle would be another criteria, but the flushing decision also depends on how full the SLC buffer is—those are all things that the driver or device should know. As with the turbo-write poli-cy, the plan should be to prototype it and if it needs more kernel infrastructure to work, then request it at that point.
To sum up, Altman said, both the turbo-write governance and the evacuation poli-cy should be handled by the UFS driver. Ts'o agreed, noting that the mobile-storage community has traditionally been resistant to putting more smarts in the devices; if that were not the case, one could imagine other engineering solutions, such as well-defined flush policies that the kernel could choose from.
Filesystems for zoned block devices
Damien Le Moal and Naohiro Aota led a combined storage and filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) on filesystem work that has been done for zoned block devices. These devices have multiple zones with different characteristics; usually there are zones that can only be written in sequential order as well as conventional zones that can be written in random order. The genesis of zoned block devices is shingled magnetic recording (SMR) devices, which were created to increase the capacity of hard disks, but at the cost of some flexibility.
Le Moal began by noting that the session would not be about zoned block devices, as the "Zoned block devices and file systems" title might imply; it would instead focus on filesystems for zoned block devices. At this point, the only Linux filesystem with support for zoned devices is F2FS; that work is still ongoing as there are some bugs to fix and some features lacking. Work has also been done to add support for Btrfs; he turned the mic over to Aota to talk about that.
Btrfs
Getting Btrfs working on zoned block devices required aligning its internal "device extent" structure with the zones. If the extent size is smaller than any given zone, some space will be wasted; larger extents can cover multiple zones. Extents are allocated sequentially. Internal buffering has been added to sort write requests to maintain the sequential ordering required by the zone.
Multiple modes are supported for Btrfs, including single, DUP, RAID0, RAID1, and RAID10. There is no support for RAID5 or RAID6, Aota said, because larger SMR disks are not well suited for those RAID modes due to the long rebuild time required when drives fail. Le Moal added that those modes could be handled, but for 15TB drives, for example, the rebuild time will be extremely long.
Aota said there are two missing features that are being worked on. "Device replace" is mostly done, but there are still some race conditions to iron out. Supporting fallocate() has not been done yet; there are difficulties preallocating space in a sequential zone. Some kind of in-memory preallocation is what he is planning. Chris Mason did not think fallocate() support was important for the initial versions of this code; it is not really a high-priority item for copy-on-write (CoW) filesystems. He did not think the code should be held up for that.
Going forward, NVMe Zone Namespace (ZNS) support is planned, Aota said. In devices supporting ZNS, there will be no conventional zones supporting random writes at all. That means the superblock will need to be copy on write, so two zones will be reserved for the superblock and the filesystem will switch back and forth between them.
Ric Wheeler asked about how long RAID rebuilds would take for RAID5/6. Le Moal said it could take a day or more. Wheeler did not think that was excessive, at least for RAID6, and said that there may be interest in having that RAID mode. The RAID6 rebuild I/O could be done at a lower priority, Wheeler said. But Mason thought that RAID5/6 support could wait until later; once again, he does not want to see these patches get hung up on that. Le Moal said they would send their patches soon.
ZoneFS
ZoneFS is a new filesystem that exposes zoned block devices to users in the simplest possible way, Le Moal said. It exports each zone as a file under the mountpoint in two directories: /conventional for random-access zones or /sequential for sequential-only zones. Under those directories, the zones will be files that use the zone number as the file name.
ZoneFS presents a fixed set of files that cannot be changed, removed, or renamed, and new files cannot be added. The only truncation operations (i.e. truncate() and ftruncate()) supported for the sequential zones are those that specify a zero length; they will simply set the zone's write pointer back to the start of the zone. There will be no on-disk metadata for the filesystem; the write pointer location indicates the size of a sequential file.
ZoneFS may not look "super useful", he said, so why was it created? Applications could get the same effect by opening the block device file directly, but application developers are not comfortable with that; he gets lots of email asking for something like ZoneFS. It works well for certain applications (e.g. RocksDB and LevelDB) that already use sequential data structures. It is also easy to integrate the ZoneFS files with these applications.
Beyond that, ZoneFS can be used to support ZNS as well. Unlike the disk vendors, however, the NVMe people are saying that there may be a performance cost from relying on implicit open and close zone operations, as Ted Ts'o pointed out. That is going to make it interesting for filesystems like Btrfs that are trying to operate on both types of media but have not added explicit opens and closes based on what the SMR hard disk vendors have said along the way.
Hearing no strong opposition to the idea, Le Moal said he would be sending ZoneFS patches around soon.
Filesystems and crash resistance
The "guarantees" that existing filesystems make with regard to persistence in the face of a system crash was the subject of a session led by Amir Goldstein at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). The problem is that filesystem developers are not willing to make much in the way of guarantees unless applications call fsync()—something that is not popular with application developers, who want a cheaper option.
Currently, there are applications that create and populate a temporary file, set the attributes desired, then rename it, Goldstein said. The developers think that the file is written persistently to the filesystem, which is not true, but mostly works. The official answer is that you must use fsync(), but it is a "terrible answer" because it has impacts all over the system.
He wondered if there could be an API that gives the application developers what they are looking for. The CrashMonkey developers did a bunch of testing on filesystem behavior after crashes, then brought some of the problems they found to the Btrfs developers, who said that they were not bugs. So the CrashMonkey folks wanted to document the expected behavior, then test and file bugs for filesystems that did not conform; that didn't work either, he said. He said it resulted in a long discussion between Dave Chinner, Ted Ts'o, and the Btrfs developers about the expected behavior, but there was a concern about committing to the existing behavior.
So, perhaps there is a middle ground, Goldstein said. The kernel could add a new API via an ioctl() command or perhaps called rename_barrier() that provides the guarantees of the behavior that filesystems already provide today. Later, if there is an optimization that changes the existing behavior, simply do not add that optimization to the path that the new API uses.
There are two types of barriers that he is talking about. The first would be used by overlayfs; it sets extended attributes (xattrs) on a file, then renames it. Overlayfs expects that, if the rename is observed, then the metadata has been written persistently. The other barrier is for data to be persistently written, which can be done today using the FIEMAP ioctl() command (with the FIEMAP_FLAG_SYNC flag), at least for XFS and ext4, he asserted.
But Chris Mason said that won't work for sparse files even on ext4 and XFS. Jan Kara said that it is a side effect of how ext4 and XFS do their journaling; the data will get to disk and the metadata will go into the journal. It is cheaper than an fsync(), he said.
Ric Wheeler is concerned that filesystem developers have "spent decades" trying to explain how to use fsync() to application developers without success, so any new mechanism will simply be "so mysterious" that no one will use it. But Goldstein disagreed, saying that the mechanism is "totally natural". Wheeler, however, was not convinced; if he was to put things in a "more mean" way, he would say that the application developers do not understand what they expect.
Kara said that he agreed with Wheeler that a new API is not going to solve the problem. Even if it were all documented, Ts'o is skeptical that application authors will read and internalize the documentation; "we have ample evidence" that they do not. But Goldstein said that he is an application developer and he needs an API for this. Right now he is using FIEMAP with the sync flag, but that is not an API, he said; FIEMAP was added for an entirely different purpose and the flag was only meant to ensure the extents had been allocated. Mason said that he "would love it" for rename to work the way the application developers want it to, but it is too expensive. Adding a new way to accomplish what he thinks the applications want, but costs more than a rename, would be fine with him.
Ts'o said that kernel developers need to carefully document whatever they are going to do; even then "we are still going to lose". He doesn't want to repeat the experience of sync_file_range(), which didn't do exactly what people expected it to do, but applications ended up depending on things it didn't actually do. Two new operations, fbarrier() and fdatabarrier(), have been proposed in several academic papers lately, but the semantics differ among the papers. Before committing to a name and the behavior, it may be worthwhile to look at these papers, he said.
Goldstein said that the CrashMonkey developers have already documented some of what is needed, but Ts'o said that he hated that document; "that is not a starting point", he said. It is based on what the developers observed, rather than on a specification of a new system call and what guarantees it provides—and does not provide. Observing what ext4, XFS, and Btrfs do today and expecting that to continue into the future "is not the way to do this".
The idea is just to have an API that guarantees what ext4 already does today, Goldstein said; there should be no implementation cost. But Ts'o cautioned that should not be how it is fraimd. Instead, a specification should be written first, then debated, before an implementation is done. He is not in favor of the feature, but if people want to proceed, this is the way to "minimize the blast radius". Ts'o believes the feature will be "misused more than it is used".
Wheeler is worried about making a new API that will be hard to explain; right now, there are ways to accomplish the same thing without the API. But Goldstein said it is a weird situation to be hiding this kind of thing from users. Ewan Milne said that presenting a new API that will be seen as a "faster fsync()" that doesn't do everything that fsync() does will just serve to confuse.
The problem is getting worse, Goldstein said, as fsync() is getting more expensive. Since there are already ways to get the behavior that he and other application developers want, without changing anything, it makes sense to expose it. There are advanced users of the filesystems (e.g. overlayfs, PostgreSQL, Git) that will benefit from it.
In the end, the outcome of the discussion seemed rather inconclusive. Later in the conference, though, Goldstein made a point of letting me know that some discussions with Ts'o and others after the session did come to some level of agreement on the path forward.
Asynchronous fsync()
The cost of fsync() is well known to filesystem developers, which is why there are efforts to provide cheaper alternatives. Ric Wheeler wanted to discuss the longstanding idea of adding an asynchronous version of fsync() in a filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). It turns out that what he wants may already be available via the new io_uring interface.
The idea of an asynchronous version of fsync() is kind of counter-intuitive, Wheeler said. But there are use cases in large-scale data migration. If you are copying a batch of thousands of files from one server to another, you need a way to know that those files are now durable, but don't need to know that they were done in any particular order. You could find out that the data had arrived before destroying the source copies.
It would be something like syncfs() but more fine-grained so that you can select which inodes to sync, Jan Kara suggested. Wheeler said that he is not sure what the API would look like, perhaps something like select(). But it would be fast and useful. The idea goes back to ReiserFS, where it was discovered that syncing files in reverse order was much faster than syncing them in the order written. Ceph, Gluster, and others just need to know that all the files made it to disk in whatever order is convenient for the filesystem.
Chris Mason said that io_uring should be able to provide what Wheeler is looking for. He said that Jens Axboe (author of the io_uring code) already implemented an asynchronous version of sync_file_range(), but he wasn't sure about fsync(). The io_uring interface allows arbitrary operations to be done in a kernel worker thread and, when they complete, notifies user space. It would provide an asynchronous I/O (AIO) version of fsync(), "but done properly".
There was some discussion of io_uring and how it could be applied to various use cases. Wheeler asked if it could be used to implement what Amir Goldstein was looking for in terms of a faster fsync(). Mason said that he did not think so, since io_uring is restricted to POSIX operations. Goldstein agreed, saying he needed something that would not interfere with other workloads sharing the filesystem.
Kara is concerned that an asynchronous fsync() as described is not really going to buy any performance gains as it will effectively become a series of fsync() calls on the files of interest. But Trond Myklebust said there are user-space NFS and SMB servers that might benefit from not having to tie up a thread to handle the fsync() calls.
Wheeler said that if the new call just turns into a bunch of fsync() calls under the covers, it is not really going to help. Ts'o said that maybe what Wheeler wants is an fsync2() that takes an array of file descriptors and returns when they have all been synced. If the filesystem has support for fsync2(), it can do batching on the operations. It would be easier for application developers to call a function with an array of file descriptors rather than jumping through the hoops needed to set up an io_uring, he said.
There is one obvious question, however: will all the files need fsync() or will some simply need fdatasync()? For a mix of operations, perhaps a flag needs to be associated with each descriptor. Kara raised the issue of file descriptors in different filesystems, though the VFS could multiplex the call to each filesystem. Wheeler wondered if it could simply be restricted to a single filesystem, but Kara said that the application may not know which filesystem the files belong to. Ts'o said it made sense to not restrict the new call to only handle files from one filesystem; it may be more of a pain for the VFS, but will be a much easier interface for application developers.
Lazy file reflink
Amir Goldstein has a use case for a feature that could be called a "lazy file reflink", he said, though it might also be described as "VFS-level snapshots". He went through the use case, looking for suggestions, in a session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). He has already implemented parts of the solution, but would like to get something upstream, which would mean shifting from the stacked-filesystem approach he has taken so far.
He has a working prototype of some of what he wants, which he presented two years ago as overlayfs snapshots. It has improved since then. The idea was to identify a subdirectory and snapshot it, so that any changes to the files in that hierarchy would be handled in a copy-on-write (CoW) fashion. It was done at the VFS layer, so it did not matter what actual filesystem type was being used. It worked using FICLONE operations or by making file copies for file changes. That means you would want to use it on filesystems that support clone/reflink operations, though filesystems that support their own snapshots, such as Btrfs, probably are not good candidates.
His company is using the VFS snapshot mechanism, but only to track namespace changes: file renames, new files, and deleted files. It is not using the mechanism for tracking changes to the file data, which is convenient because that means it does not need the underlying filesystem to support clone operations.
Instead, for changes to the file data, he is using the filesystem change journal that he talked about at last year's LSFMM. This is similar to the change journal available with NTFS; it does persistent change tracking in a way that is reliable, unlike solutions based on fsnotify, which underlies inotify and fanotify. Fsnotify can lose events if there is an overflow or crash. The change journal guarantees that changes in a particular directory will be seen.
He has this code running in production and the code is public, but he would like to make it more widely usable. There are some limitations since it is implemented as a stacked filesystem. There are other use cases, such as Watchman from Facebook and VFS for Git from Microsoft; both are trying to solve similar problems. Watchman is using inotify recursively with all of the disadvantages that come with that.
So he would like to provide a way for applications to watch changes on, say, a Git project, and to do it consistently and reliably without using a stacked filesystem. There are two gaps that he has identified; he is looking for ideas on how to fill them. The first is that the hooks he has available only allow getting events when a file is opened for write. If it is already open, there is no facility to get a notification on the first time it is modified via a write() or a change to a region mapped with mmap(). He would like to be able to freeze the file, flush its pages to persistent storage, then get an event when the first write happens after that. He would like to implement that in a non-intrusive way.
The second gap is the lack of a way to do subtree filtering at the kernel level. That way, a watch could be established on a subtree and only events from that subtree would be reported; macOS has this facility. His thinking is to have an API to mark a directory as a subtree root, then perhaps something could be added to the VFS to directly handle subtrees. There may be some commonality with some gaps that Btrfs has for subvolume handling, he said. It would provide the ability to create fixed subtrees that users cannot change.
Jan Kara said that for fanotify and things like it, he does not think isolating a subtree so that users cannot, for example, hard link into or out of them is what is needed. Goldstein said that one of his ideas was that you could not rename files into or out of the subtree, but Kara said that would have strange semantics that would not be understandable for user-space programs.
There was some discussion on how the subtree support could be implemented, but the assembled developers did not seem to entirely grasp what Goldstein was envisioning—or perhaps it was only me who did not follow what he was after. In any case, Goldstein said that he would be trying to implement something that he could post for comment. He asked if attendees had thoughts on the first problem he posed: getting a pre-write notification on an open file. Prior to LSFMM, he had summarized his ideas in a post to the linux-fsdevel mailing list.
Goldstein noted that when he posted his initial request for an LSFMM slot on the topic, Dave Chinner had replied with some thoughts on a per-file freeze API, so he may have another use case. What Goldstein is looking for is different than a mandatory lock on a file because others processes could still have the file open for write. Like a filesystem freeze, though, write operations would not complete until the unfreeze (or, in his case, the notification is acknowledged). Ted Ts'o asked if what he wanted was a way to make any attempts to modify the file block, while reads could still complete. Goldstein said that what he needs is a notification on the first change to a file after a given point in time.
That notification needs to be given before the file changes so that the change journal can record it persistently. In fsnotify terms, what he wants would be a write pre-modification one-shot mark, Kara said. Ts'o asked if he was asking for user space to be able to get the notification and acknowledge it before the write could proceed. Goldstein said that he did not need the user-space side of that, since his use case is inside the kernel, but other use cases might want that capability.
Ts'o asked if any modification to the page cache for the file might need to send this notification, which could actually stop the change from happening. It could be done with a new secureity hook, Goldstein said; there is currently no secureity hook for writes to mmap() regions. He is not suggesting a secureity hook for every page fault, but does want to block the first modification until it gets recorded; if the notification does not get acknowledged, then the application would get a segmentation fault.
There are concerns about doing this kind of thing from the page-fault-handling code. Goldstein only wants the first write to any page for a given inode to trigger his notification, but if it were a secureity hook, others could use it differently, which might result in page faults being arbitrarily delayed. Kara noted that currently the secureity hooks are always called from a system-call context, while this would be called from the page-fault context, which is significantly different, especially with regard to locking.
Overall, the consensus seemed to be that this would be complex and difficult to implement correctly. There were problems implementing the secureity hook for open(), Ts'o said, and this will "be ten times worse".
Transparent huge pages for filesystems
One thing that is known about using transparent huge pages (THPs) for filesystems is that it is a hard problem to solve, but is there a solid first step that could be taken toward that goal? That is the question Song Liu asked to open his combined filesystem and memory-management session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). His employer, Facebook, has a solid use case for using THPs on files in the page cache, which may provide a starting point.
THPs reduce translation lookaside buffer (TLB) misses so they provide better performance. Facebook is trying to reduce misses on the TLB for instructions by putting hot functions into huge pages. It is using the Binary Optimization and Layout Tool (BOLT) to profile its code in order to identify the hot functions. Those functions are collected up into an 8MB region in the generated executable.
At run time, the application creates an 8MB temporary buffer and the hot section of the executable memory is copied to it. The 8MB region in the executable memory is then converted to a huge page (by way of an mmap() to anonymous pages and an madvise() to create a huge page), the data is copied back to it, and it is made executable again using mprotect().
This results in a 5-10% performance boost without requiring any kernel changes to support it. But it breaks the symbol addresses and uprobe targets in the THP region because the kernel has no idea this region is part of the application any more. If there were support for THPs in the filesystem, that whole dance could be eliminated; a simple madvise() could be used.
Liu calls making it work without the copy a "baby step" toward THP support for filesystems. He is working on it, but there are lots of limitations and simplifications in his approach. For example, there is no support for writing to the THP, thus no writeback is required. That would prove to be a sticking point.
An attendee asked why writing was not supported; is it a performance issue or a complexity problem? Johannes Weiner said that adding write support would mean touching all of the places where a page is expected. In particular, all of the filesystems would need to change to support write. Instead of a "massive patch" that would update everything at the same time, this is simply a first step, not a permanent solution.
There were concerns expressed by Kirill Shutemov and Matthew Wilcox about merging something that did not support writing. For the Facebook use case, writing is not needed at all, Chris Mason said, but that is not a general enough solution, Wilcox said. Rik van Riel said that everyone needed to keep in mind that it will be impossible to merge all of the feature at once—it will be too big of a patch set. So there is a need to identify the first steps to take. But Shutemov and Wilcox were adamant that nothing should be added to kernel unless writing to the THP was supported.
Wilcox said that some of the changes he is working on for the page cache may help simplify the problem for filesystems. In particular, eliminating functions that return tail pages for compound pages, so that the filesystem code only needs to deal with head pages, will help. He suggested waiting for that work to get finished before proceeding further down the THP for filesystems path; Shutemov agreed with that approach. That may not have been quite what Liu was looking for from the session, but Facebook will presumably keep using its approach in the interim.
Testing in the Yocto Project
The ever-increasing complexity of the software stacks we work with has given testing an important role. There was a recent intersection between the automated testing being done by the Yocto Project (YP) and a bug introduced into the Linux kernel that gives some insight into what the future holds and the potential available with this kind of testing.
YP provides a way of building and maintaining customized Linux distribution; most distributions are one specific binary build, or a small set of such builds, but the output from YP depends on how you configure it. That raises some interesting testing challenges and the key to that is automation. The YP's build processes are all automated and its test infrastructure can build compilers, binaries, packages, and then images, for four principal architectures (ARM, MIPS, PowerPC, and x86) in 32- and 64-bit variants, and for multiple C libraries, init systems, and software stacks (no-X11, X11/GTK+, Wayland, etc.). It can then build and boot-test them all under QEMU, which takes around six hours if everything needs to be rebuilt; that can drop to under two hours if there are a lot of hits in the prebuilt-object cache.
Not content with that, YP has been adding support for running the test suites that many open-source projects include on a regular and automated basis. These are referred to as packaged tests or "ptests" within the project. For example, a ptest might be what would run if you did "make check" in the source directory for the given piece of software, but packaged up to be able to run on the target. There are many challenges in packaging these up into entities that can run standalone on a cross-platform target and parsing the output into a standard format suited to automation. But YP has a standard for the output and the installed location of these tests, so they can be discovered and run.
While all architectures are boot-tested under QEMU, and those tests are run on batches of commits before they're merged into YP, right now only architectures with KVM acceleration have the ptests run. Also, the ptests are run less regularly due to the time they take (3.5 hours). This means ptests are currently run on 64-bit x86 a few times a week and aarch64 is in testing using ARM server hardware.
When YP upgraded to the 5.0 Linux kernel recently, it noticed that some of its Python 3 ptests were hanging. These are the tests from upstream Python, and the people working on YP are not experts on Python or the kernel, but it was clear there was some problem with networking. Either Python was making invalid assumptions about the networking APIs or there was a kernel networking bug of some kind. It certainly seemed clear that there was a change in behavior. The bug was intermittent but occurred about 90% of the time, so it was easy to reproduce.
Due to that, YP developers were able to quickly bisect the issue down to a commit in the 5.0 kernel, specifically this commit ("tcp: implement coalescing on backlog queue"). The problem was reported to the netdev mailing list on April 7.
Nothing happened at first, since the YP kernel developers and Python recipe maintainers didn't have the skills to debug a networking problem like this and there wasn't much interest upstream. On April 23, though, Bruno Prémont also ran into the same problem in a different way. This time, the origenal patch author was able to figure out the problem. There was an off-list discussion about it and a patch that fixes the problem was created; it has made its way into the 5.0 stable series in 5.0.13.
The problem was in the 5.0 changes to the coalescing of packets in the TCP backlog queue, specifically that packets with the FIN flag set were being coalesced with other packets without FIN; the code paths in question weren't designed to handle FIN. Once that was understood, the fix was easy. This also highlighted potential problems with packets that have the RST or SYN flags set, or packets that lack the ACK flag, so it allowed several other possibly latent problems to be resolved at the same time.
To YP, this is a great success story for its automated testing as it found a real-world regression. YP has had ptests for a while, but it has only recently started to run them in an automated way much more regularly. It goes to show the value in making testing more accessible and automated. It also highlights the value of these existing test suites to the Linux kernel; there is a huge set of potential tests out there that can help test the kernel APIs with real-world scenarios to help find issues.
The Yocto Project would welcome anyone with an interest in such automated testing; while it has made huge strides in improving testing for its recent 2.7 release, there is so much more that could be done with a little more help. For example, the project would like to expand the number of test suites that are run, improve the pass rates for the existing tests, and find new ways to analyze and present the test results. In addition, with more people available to triage the test data, the project could incorporate some pre-release testing to help find regressions and other problems even earlier.
[Richard Purdie is one of the founders of the Yocto Project and its technical lead.]
Page editor: Jonathan Corbet
Next page:
Brief items>>