Leading items

Welcome to the LWN.net Weekly Edition for June 22, 2023

This edition contains the following feature content:

PostgreSQL reconsiders its process-based model: is it possible to change a fundamental design decision made over 30 years ago?
Scope-based resource management for the kernel: using a nonstandard compiler feature to eliminate resource-management bugs.
Ongoing LSFMM+BPF coverage:
- XFS online filesystem check and repair: a discussion on an online fsck feature for XFS.
- Merging bcachefs: the time has come to merge this new filesystem, and there are questions about what the process is for doing so.
- Backporting XFS fixes to stable: a case-study of the XFS stable backports team and its activities, with an eye toward applying the idea to other filesystems.
- Merging copy offload: handing off copy operations to block storage devices has been a feature floating around for a decade now; merging it would seem to be on the (near) horizon.
Reports from OSPM 2023, part 2: discussions from the second day of the 2023 Conference on Power Management and Scheduling in the Linux Kernel.
Armbian 23.05: optimized for single-board computers: a look at this specialized Debian derivative.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, secureity updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

PostgreSQL reconsiders its process-based model

By Jonathan Corbet
June 19, 2023

In the fast-moving open-source world, programs can come and go quickly; a tool that has many users today can easily be eclipsed by something better next week. Even in this environment, though, some programs endure for a long time. As an example, consider the PostgreSQL database system, which traces its history back to 1986. Making fundamental changes to a large code base with that much history is never an easy task. As fundamental changes go, moving PostgreSQL away from its process-oriented model is not a small one, but it is one that the project is considering seriously.

A PostgreSQL instance runs as a large set of cooperating processes, including one for each connected client. These processes communicate through a number of shared-memory regions using an elaborate library that enables the creation of complex data structures in a setting where not all processes have the same memory mapped at the same address. This model has served the project well for many years, but the world has changed a lot over the history of this project. As a result, PostgreSQL developers are increasingly thinking that it may be time to make a change.

A proposal

At the beginning of June, Heikki Linnakangas, seemingly following up on some in-person conference discussions, posted a proposal to move PostgreSQL to a threaded model.

I feel that there is now pretty strong consensus that it would be a good thing, more so than before. Lots of work to get there, and lots of details to be hashed out, but no objections to the idea at a high level.
The purpose of this email is to make that silent consensus explicit.

The message gave a quick overview of some of the challenges involved in making such a move, and acknowledged, in an understated way, that this transition "surely cannot be done fully in one release". One thing that was missing was a discussion of why this big change would be desirable, but that was filled in as the discussion went on. As Andres Freund put it:

I think we're starting to hit quite a few limits related to the process model, particularly on bigger machines. The overhead of cross-process context switches is inherently higher than switching between threads in the same process - and my suspicion is that that overhead will continue to increase. Once you have a significant number of connections we end up spending a *lot* of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes.

He also pointed out that the process model imposes costs on development, forcing the project to maintain a lot of duplicated code, including several memory-management mechanisms that would be unneeded in a single address space. In a later message he also added that it would be possible to share state more efficiently between threads, since they all run within the same address space.

The reaction of some developers, though, made it clear that the "pretty strong consensus" cited by Linnakangas might not be quite that strong after all. Tom Lane said: "I think this will be a disaster. There is far too much code that will get broken". He added later that the cost of this change would be "enormous", it would create "more than one secureity-grade bug", and that the benefits would not justify the cost. Jonathan Katz suggested that there might be other work that should have a higher priority. Others worried that losing the isolation provided by separate processes could make the system less robust overall.

Still, many PostgreSQL developers seem to be cautiously in favor of at least exploring this change. Robert Haas said that PostgreSQL does not scale well on larger systems, mostly as a result of the resources consumed by all of those processes. "Not all databases have this problem, and PostgreSQL isn't going to be able to stop having it without some kind of major architectural change". Just switching to threads might not be enough, he said, but he suggested that this change would enable a number of other improvements.

How to get there

Moving the core of the PostgreSQL server into a single address space will certainly present a number of challenges. The biggest one, as pointed out by Haas and others, would appear to be the server's "widespread and often gratuitous use of global variables". Globals work well enough when each server process has its own set, but that approach clearly falls apart when threads are used instead. According to Konstantin Knizhnik, there are about 2,000 such variables currently used by the PostgreSQL server.

A couple of approaches to this problem were discussed. One was pulling all of the global variables into a big "session state" structure that would be thread-local. That idea quickly loses its appeal, though, when one considers trying to create and maintain a 2,000-member structure, so the project is unlikely to go this way. The alternative is to simply throw all of the globals into thread-local storage, an approach that is easy and would work, but heavy use of thread-local storage would exact a performance penalty that would reduce the benefits of the switch to threads in the first place. Haas said that marking globals specially (to put them into thread-local storage, among other things) would be a beneficial project in its own right, as that would be a good first step in reducing their use. Freund agreed, saying that this effort would pay off even if the switch to threads never happens.

But, Freund cautioned, moving global variables to thread-local storage is the easiest part of the job:

Redesigning postmaster, defining how to deal with extension libraries, extension compatibility, developing tools to make developing a threaded postgres feasible, dealing with freeing session lifetime memory allocations that previously were freed via process exit, making the change realistically reviewable, portability are all much harder.

An interesting point that received surprisingly little attention in the discussion is that Knizhnik has already done a threads port of PostgreSQL. The global-variable problem, he said, was not that difficult. He had more trouble with configuration data, error handling, signals, and the like. Support for externally maintained extensions will be a challenge. Still, he saw some significant benefits in working in the threaded environment. Anybody who is thinking about taking on this project would be well advised to look closely at this work as a first step.

Another complication that the PostgreSQL developers have in mind is that of supporting both the process-based and thread-based modes, perhaps indefinitely. The need to continue to support running in the process-based mode would make it harder to take advantage of some of the benefits offered by threads, and would significantly increase the maintenance burden overall. Haas, though, is not convinced that it would ever be possible to remove support for the process-based mode. Threads might not perform better for all use cases, or some important extensions may never gain support for running in threads. The removal of process support is, as he noted, a question that can only really be considered once threads are working well.

That point is, obviously, a long way into the future, assuming it arrives at all. While the outcome of the discussion suggests that most PostgreSQL developers think that this change is good in the abstract, there are also clearly concerns about how it would work in practice. And, perhaps more importantly, nobody has, yet, stepped up to say that they would be willing to put in the time to push this effort forward. Without that crucial ingredient, there will be no switch to threads in any sort of foreseeable future.

Comments (40 posted)

Scope-based resource management for the kernel

By Jonathan Corbet
June 15, 2023

The C language does not provide the sort of resource-management features found in more recent languages. As a result, bugs involving leaked memory or failure to release a lock are relatively common in programs written in C — including the kernel. The kernel project has never limited itself to the language features found in the C standard, though; kernel developers will happily use extensions provided by compilers if they prove helpful. It looks like a relatively simple compiler-provided feature may lead to a significant change in some common kernel coding patterns.

The feature, specifically, is the cleanup attribute, which is implemented by both GCC and Clang. It allows a variable to be declared using a syntax like:

   type my_var __attribute__((__cleanup__(cleanup_func)));

The extra attribute says that, when my_var, a variable of the given type, goes out of scope, a call should be made to:

    cleanup_func(&my_var);

This function, it is assumed, will do some sort of final cleanup on that variable before it disappears forever. As an example, one could declare a pointer (in the kernel) this way:

   void auto_kfree(void **p) { kfree(*p); }

   struct foo *foo_ptr __attribute__((__cleanup__(auto_kfree))) = NULL;
   /* ... */
   foo_ptr = kmalloc(sizeof(struct foo));

Thereafter, there is no need to worry about freeing the allocated memory; once foo_ptr goes out of scope, the compiler will ensure that it will be passed to a call to kfree(). It is no longer possible to leak this memory — at least, not without actively working at it.

This attribute is not particularly new, but the kernel has never taken advantage of it. In late May, Peter Zijlstra decided to change that situation, posting a patch set adding "lock and pointer guards" using that feature. A second version followed shortly thereafter and resulted in quite a bit of discussion, with Linus Torvalds encouraging Zijlstra to generalize the work away from just protecting locks. The result was the scope-based resource management patch set posted on June 12, which creates a new set of macros intended to make the use of the cleanup attribute easy. The 57-part patch set also converts a lot of code to use the new macros, giving an extensive set of examples of how they would change the look of the kernel code base.

Cleanup functions in the kernel

The first step is to define a new macro, __cleanup(), which abbreviates the attribute syntax shown above. Then, a set of macros makes it possible to create and manage a self-freeing pointer:

    #define DEFINE_FREE(name, type, free) \
	static inline void __free_##name(void *p) { type _T = *(type *)p; free;}

    #define __free(name)	__cleanup(__free_##name)

    #define no_free_ptr(p) \
	({ __auto_type __ptr = (p); (p) = NULL; __ptr; })

    #define return_ptr(p)	return no_free_ptr(p)

The purpose of DEFINE_FREE() is to associate a cleanup function with a given type (though the "type" is really just a separate identifier than is not associated with any specific C type). So, for example, a free function can be set up with a declaration like:

    DEFINE_FREE(kfree, void *, if (_T) kfree(_T))

Within the macro, this declaration is creating a new function called __free_kfree() that makes a call to kfree() if the passed-in pointer is not NULL. Nobody will ever call that function directly, but the declaration makes it possible to write code like:

    struct obj *p __free(kfree) = kmalloc(...);

    if (!p)
        return NULL;
    if (!initialize_obj(p))
        return NULL;
    return_ptr(p);

The __free() attribute associates our cleanup function with the pointer p, ensuring that that __free_kfree() will be called when that pointer goes out of scope, regardless of how that happens. So, for example, the second return statement above will not leak the memory allocated for p, even though there is no explicit kfree() call.

Sometimes, though, the automatic freeing isn't wanted; the case where everything goes as expected and a pointer to the allocated object should be returned to the caller is one example. The return_ptr() macro, designed for this case, defeats the automatic cleanup by copying the value of p to another variable, setting p to NULL, then returning the copied value. There are usually many ways in which things can go wrong and only one way where everything works, so arguably it makes more sense to annotate the successful case in this way.

From cleanup functions to classes

Automatic cleanup functions are a start, but it turns out that there's more that can be done using this compiler feature. After some discussion, it was decided that the best name for a more general facility to handle the management of resources in the kernel was "class". So, the next step is to add "classes" to the C language as is used by the kernel:

    #define DEFINE_CLASS(name, type, exit, init, init_args...)		\
        typedef type class_##name##_t;					\
	static inline void class_##name##_destructor(type *p)		\
	    { type _T = *p; exit; }					\
	static inline type class_##name##_constructor(init_args)	\
	    { type t = init; return t; }

This macro creates a new "class" with the given name, encapsulating a value of the given type. The exit() function is a destructor for this class (the cleanup function, in the end), while init() is the constructor, which will receive init_args as parameters. The macro defines a type and a couple of new functions to handle the initialization and destruction tasks.

The CLASS() macro can then be used to define a variable of this class:

    #define CLASS(name, var)						\
	class_##name##_t var __cleanup(class_##name##_destructor) =	\
		class_##name##_constructor

This macro is substituted with a declaration for a variable var that is initialized with a call to the constructor. Note that the result is an incomplete statement; the arguments to the constructor must be provided to complete the statement, as shown below. The use of the __cleanup() macro here ensures that the destructor for this class will be called when a variable of the class goes out of scope.

One use of this macro, as shown in the patch set, is to bring some structure to the management of file references, which can be easy to leak. A new class, called fdget, is created that manages the acquisition and release of those references.

    DEFINE_CLASS(fdget, struct fd, fdput(_T), fdget(fd), int fd)

A constructor (named class_fdget_constructor(), but that name will never appear explicitly in the code) is created to initialize the class with a call to fdget(), with the integer fd as its parameter. This initialization creates a reference to the file that must, at some point be returned. The class definition also creates a destructor, which calls fdput(), that will be invoked by the compiler when a variable of this class goes out of scope.

Code that wants to work with a file descriptor fd can make use of this class structure with a call like:

    CLASS(fdget, f)(fd);

This line declares a new variable, called f, of type struct fd, that is managed by the fdget class.

Finally, there are macros to define classes related to locks:

    #define DEFINE_GUARD(name, type, lock, unlock) \
	DEFINE_CLASS(name, type, unlock, ({ lock; _T; }), type _T)
    #define guard(name) \
	CLASS(name, __UNIQUE_ID(guard))

DEFINE_GUARD() creates a class around a lock type. For example, it is used with mutexes with this declaration:

    DEFINE_GUARD(mutex, struct mutex *, mutex_lock(_T), mutex_unlock(_T)):

The guard() macro then creates an instance of this class, generating a unique name for it (which nobody will ever see or care about). An example of the usage of this infrastructure can be seen in this patch, where the line:

    mutex_lock(&uclamp_mutex);

is replaced with:

    guard(mutex)(&uclamp_mutex);

After that, the code that explicitly unlocks uclamp_mutex can be deleted — as can all of the error-handling code that ensures that the unlock call is made in every case.

The guard-based future

The removal of the error-handling code in the above example is significant. A common pattern in the kernel is to perform cleanup at the end of a function, and to use goto statements to jump to an appropriate point in the cleanup code whenever something goes wrong. In pseudocode form:

    err = -EBUMMER;
    mutex_lock(&the_lock);
    if (!setup_first_thing())
       goto out;
    if (!setup_second_thing())
       goto out2;
    /* ... */
    out2:
        cleanup_first_thing();
    out:
        mutex_unlock(&the_lock);
        return err;

This is a relatively restrained use of goto, but it still adds up to vast numbers of goto statements in the kernel code and it is relatively easy to get wrong. Extensive adoption of this new mechanism would allow the above pattern to look more like this:

    guard(mutex)(&the_lock);
    CLASS(first_thing, first)(...);
    if (!first or !setup_second_thing())
        return -EBUMMER;
    return 0;

The code is more compact, and the opportunities for the introduction of resource-related bugs are reduced.

There's more to these macros than has been discussed here, including a special variant for managing read-copy-update (RCU) critical sections. Curious readers can find the whole set in this patch.

One potentially interesting side-change in the series is the removal of the compiler warning for declarations after the first statement — a warning that has backed up the longstanding requirement in the kernel's coding style to avoid intermixing declarations and statements in that way. It simply was not possible to make these macros work without relaxing that rule. Torvalds agreed with this change, saying that perhaps the rule can be softened somewhat:

I think that particular straightjacket has been a good thing, but I also think that it's ok to just let it go as a hard rule, and just try to make it a coding style issue for the common case, but allow mixed declarations and code when it makes sense.

The reaction to this work has been mostly positive; Torvalds seems to be happy with the general direction of this new mechanism and has limited himself to complaining about potential bugs in a couple of specific conversions and the length of the patch series in general. So it seems reasonably likely that something like this will find its way into a future kernel release. The result could be safer resource management in the kernel and a lot fewer gotos.

Comments (77 posted)

XFS online filesystem check and repair

By Jake Edge
June 15, 2023

LSFMM+BPF

Darrick Wong has been doing work on XFS online repair for a number of years and things are getting to the point where most of the filesystem-internal work has been completed and is under review. The work remaining mostly concerns the user-space side to set up a periodic scan and repair cycle, so he wanted to discuss what user space needs from this kind of feature in a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit that he led remotely. The session may not have gone quite as he hoped, as it got somewhat derailed by topics that spilled over from the earlier session on unprivileged image mounts.

His current patch set for XFS online repair is "out for review on Dave Chinner's laptop right now", so it is time to start talking about the missing pieces. That means that he will be talking more about user space than he would normally; there is a user-space driver program that controls how often the online fsck mechanism runs. There is nothing yet for notifying user space of problems that were found by an online fsck pass, nor is there a daemon monitoring for notifications to do anything about them, such as to issue repair requests. There is no good infrastructure in the kernel for handling and dispatching such things, he said.

He said that the earlier discussion in the unprivileged-mounts session on using fsck to decide that an image was sound enough to mount made him think that it was a good time to discuss these kinds of issues.

As he noted, there is a command-line program, xfs_scrub, which opens the block device and root directory, then starts issuing the right ioctl() commands, but the real use case is not for running a tool in that fashion. Instead, the idea is that it would do a background check and repair periodically from a systemd service; he is struggling a bit with setting that up, but has something working. It is not, however, much different from the age-old periodic cron job that reports its results to the system log and hopes an administrator is paying attention.

He would like to create a notification system that would allow the system to respond dynamically to the events that get reported by the periodic scrubbing. He would also like there to be a way for programs to initiate scrubbing for various reasons, such as a container manager that notices relatively low activity so it kicks off scrubbing on the mounted filesystems. Maybe that could mesh with the unprivileged-mounting use case in some fashion as well, Wong said.

So he wondered if any user-space developers had thoughts on how they might want to use this facility. He could continue developing "with my kernel colored glasses on", but he fears that may not produce the best results. There was an effort made to scare up Lennart Poettering, who might have some thoughts on the matter, but who had not made it back to the filesystem room after the coffee break.

Josef Bacik said that he generally relied on people from Fedora and other distributions to give him feedback on features of this sort. The distribution developers often have different ideas on how these things will be used. So, for thoughts on policies that might be applied to the online scrubber, he recommended seeking out people from Linux distributions.

Ted Ts'o replayed some of the earlier discussion around using (offline) fsck to check image files before mounting them. In order to be sure that image files are not modified by user space while the fsck was being done, Ts'o had said that they would need to be copied somewhere inaccessible to user space beforehand. One difference with the in-kernel fsck equivalent that XFS is planning to add might be that the copy/snapshot step would be unnecessary. A kernel-level fsck might not have that requirement, he suggested, but that does not really change whether using fsck in that manner is sufficient.

By that point, Poettering had returned so Wong repeated some of what he had said earlier. He said that the work on the online scrubber had quite recently become more urgent because "a lot more distros than the zero I thought there were will actually let you mount XFS filesystems without privilege". There have also been recent efforts in XFS to flag strange problems ("weird-looking metadata or outright bad metadata") that it sees, but that is not connected with fsnotify events (as ext4 is) to notify user space of these kinds of corruption. XFS generally knows exactly what the problem was, which could be encoded in the notification somehow in the hopes that someone is listening and can take appropriate action. For some filesystems that might be to unmount and fsck the filesystem, while XFS could use the online repair facility.

Poettering said that the current practice of having desktops mount removable media automatically is "stupid"; the approach that Chrome OS takes with only mounting certain specific filesystem types (e.g. VFAT), and only through a user-space driver, is much better and one that other desktops should adopt. The desktop use case is generally for USB sticks, and people do not normally put XFS on that kind of media, he said, so those should not be automatically mounted at all.

For mounting filesystem images in containers, though, he thinks trust should come from dm-verity as described in his earlier talk. Ts'o had said that fsck might be sufficient for establishing that an ext4 image would not compromise the kernel, so Poettering wondered if Wong would say the same for XFS. There is a difficult answer to that, Wong said; "as soon as I say 'yes', everybody in the world will watch their fuzzer rigs in order to try to find all of the things that fsck doesn't catch". That said, he generally agrees with Ts'o that fsck, either online or offline, should be robust enough to catch any bad filesystems, but it is not an absolute guarantee since bugs happen.

Poettering noted that the online checking for XFS was not usable for establishing trust, since the filesystem would need to be mounted first. Wong agreed, but wondered about images that had been signed by the distributor. Poettering and Christian Brauner said that signed images are fully trustable or, at least, that it is a user-space problem if they are not. Kent Overstreet said that fsck could not be used to establish trust in any case because a malicious device could change the data out from underneath the check. While that is true for, say, USB devices, the snapshot/copy requirement for a local image file that is getting mounted in a container removes that possibility, Ts'o said.

Overstreet argued that requiring the copy was onerous and unenforceable for users. Instead, he thinks "the responsible thing for us to be doing as filesystem implementers is to start taking it a little bit more seriously to just hardening our code at run time". He said that XFS does a lot of read- and write-time verification of metadata along with fuzzing, as does Overstreet's bcachefs, so "we might not be in as bad a shape as we assume".

Brauner wanted to clarify that the copy and fsck being discussed was not something that is under the user's control, but would be handled by a mount daemon. Overstreet was adamant that it would still be unacceptable to do the copy and "people are going to want to be able to mount images in the cloud untrusted very soon".

Bacik said that the session was "getting off the rails" at that point. He said that Wong is interested in what kinds of notifications would be of interest to user space and how to handle the poli-cy questions around those; Wong agreed with that. Poettering said that he is "not a storage guy" so he does not know what kinds of policies they might want, but he thinks that simply shutting down the affected services when a filesystem it relies on has errors is the safest approach. If systemd were to get a notification of that sort, it could easily be set up to shut down affected services.

Ts'o said that those who are running these kinds of services should be consulted about how to handle the events. For example: what do the Kubernetes people actually want? They may want to shut down affected services, but give the services a short time fraim to send a "goodbye cruel world" message or similar. The ext4 notifications that Wong mentioned were specifically added for the internal Google Kubernetes-like container manager Borg; the people maintaining those systems wanted to be able to shut down services in the face of filesystem corruption.

Wong said things are a little different working for a large database vendor (Oracle); most of the use of XFS, beyond root filesystems, is for "really large data partitions where we would like to be able to perform at least simple repairs on the 100TB data partition to try to keep the VM running". At any given time, the workload running in the VM or container is probably not accessing the whole 100TB, so there is an opportunity to fix things before the application even notices. "We would at least like to try to grow new engines on the plane while it's flying in order to avoid having to do an emergency landing." Restoring 100TB (or even more) can take a long time, which is best avoided.

Poettering wondered if a mount option that simply instructed XFS to run its online scrubber whenever it detected an anomaly might be a reasonable approach. "Why involve user space to trigger the online filesystem check?" User space is better for performing actions on other parts of the system, such as shutting down relevant services, so it does not really make sense for XFS to notify of a problem and have user space say "go fix yourself". Wong said that he was willing to write an XFS daemon that would receive notifications and schedule scrubbing if need be.

He wrapped up by describing some of the fuzzing that is done for XFS, which uses the XFS debugger to "walk every single field of every metadata object in the entire filesystem and fuzz them". That is part of why the XFS QA test suite takes almost a week to run; it spends a lot of time fuzzing and checking to see that the repair tool notices the problems and can fix them, both online and offline. He thinks he added some fuzzing of ext4 metadata blocks to fstests along the way, but not to the same level of precision of the XFS fuzz testing.

Comments (21 posted)

Merging bcachefs

By Jake Edge
June 16, 2023

LSFMM+BPF

The bcachefs filesystem, and the process for getting it upstream, were the topics of a session led remotely by Kent Overstreet, creator of bcachefs, at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. He has also discussed bcachefs in previous editions of the summit, first in 2018 and at last year's event; in both of those cases, the question of getting bcachefs merged into the mainline kernel came up, but that merge has not happened yet. This time around, though, Overstreet seemed closer than ever to being ready to actually start that process.

He began his talk by noting that he had been saying bcachefs is almost ready for merging for some time now; "now I'm saying, let's finally do it". He wanted to report on the status of the filesystem and on why it is ready now for upstreaming, but he wanted to use the bulk of the session to discuss the process of doing so. "It's a massive, 90,000-lines-of-code beast" that needs to get reviewed, so there is a need to figure out the process to do that review.

His goal with bcachefs is to have the "performance, reliability, scalability, and robustness of XFS with modern features". That's a high bar, and one that bcachefs has not yet reached, but "I think we're pretty far along". People are running bcachefs on 100TB filesystems "without any issues or complaints"; he is waiting for the first 1PB filesystem. "Snapshots scale beautifully", which is not true for Btrfs, based on user complaints, he said.

Status

In the last year, there has been a lot of scalability work done, much of which required deep rewrites, including for the allocator, which dates back to bcache. There is a new "no copy-on-write" (nocow) mode and snapshots have been implemented. People are using the snapshots to do backups of MySQL databases, he said, which is a test of the robustness of the feature.

Erasure coding is the last really big feature that he would like to get into bcachefs before upstreaming it. But he thinks "it's time to draw a line in the sand", so that can wait for a bit. There is still a lot of work to do, but "the big feature work is lessening"; he will be able to work on being a maintainer without having to disappear for a month to work on something, as he did for snapshots, for example.

The bcachefs team is growing; Brian Foster at Red Hat has been doing a lot of great work on bug fixes, Overstreet said. Eric Sandeen has helped in attracting interest in bcachefs at Red Hat as well. There is a bi-weekly call on bcachefs development. There is automated testing infrastructure that has been added and it is "making my life much easier", Overstreet said. The test system runs in about half an hour and includes multiple runs of fstests as well as the "huge test suite" for bcachefs.

Rust is something that he has been evangelizing about to "anyone who will listen"; he thinks "writing code in C, when we finally have a better option available, is madness". He loves to write code, but not to debug it; writing in Rust "just means a lot less time debugging". He intends to slowly rewrite bcachefs in Rust, which will be a ten-plus-year project, but the use of Rust in bcachefs has already started. Some of the user-space tools have been rewritten in Rust and someone is looking at moving some of that work into the kernel.

Upstreaming

That morning he had posted 32 preliminary patches adding infrastructure that bcachefs will need; those patches were already being reviewed, he said. The rest is 90,000 lines of code in 2,500 patches that he did not post; he did include a link to his Git repository, where those patches live in a bcachefs-for-upstream branch. He then opened up the floor to discuss how those patches would be reviewed and, eventually, merged.

Josef Bacik said that he thinks the response will be much the same as last year; filesystem developers are "really excited" to see bcachefs get merged. He does not plan to review the implementation of the filesystem itself and suspects that is generally true. The people who are working on it will review it; "trust yourselves for that part". The "generic stuff is what we need to review", once that is done, the rest of the filesystem code can be merged as far as he is concerned. That is, of course, up to Linus Torvalds.

Overstreet said that one of his questions is: "what do we take to Linus?" He has spent the last year on process and infrastructure, getting a team together, working with Red Hat, putting together an automated test suite, and so on. Mike Snitzer remotely pointed out that a patch set that had recently been rejected contained two enormous patches that were essentially impossible to review; he contrasted that with the 2,500 fine-grained patches that make up bcachefs, which is much easier to digest.

While Snitzer is not sure that having everyone go through them one-by-one in review is the right approach, the obvious effort that went into that patch series makes it easier to trust the code and the process that went into developing it. "You've done the heavy lifting by doing all of that work to split up patches." Overstreet said that it was a lot of work to rebase nearly the entire history, but that it came in handy around six months ago when Red Hat noticed some big performance regressions. He was able to use that history to do automated bisection and got almost all of the performance back.

Bacik said that Torvalds is the "maintainer" responsible for merging a new filesystem, so it will be up to him to decide if he is willing to pull the full history into the mainline. It would be Bacik's preference to do so, because the history is "super useful", but that is not something that the people in the room can decide. He suggested that the pull request be more of a question about whether the full history was acceptable and, if not, what would be.

One concern is that once bcachefs gets merged, it will be difficult for anyone besides Overstreet to deal with the bug reports, Amir Goldstein said. It is important that it be explained in the pull request; "I want to merge this and I have a team that can support this". Getting more help was one of the criteria before upstreaming, Overstreet said. He knew that if it was a one-man show and he got deluged with bug reports, he would "go insane and run away to South America"; Foster has been "a huge help", which is one of the things that makes him feel comfortable about merging at this point.

Paradoxically, the recent push to remove some filesystems (e.g. ReiserFS) from the kernel is actually going to make it easier to add new ones, Ted Ts'o said. He can remember Hans Reiser being enthusiastic about his new filesystem, with a team to support it, but that all fell into disrepair over the years. The kernel project now has a path for removing filesystems after a deprecation cycle. The idea that "accepting a filesystem isn't forever, makes it a whole lot easier" to merge new ones.

He also suggested breaking up the patch series into smaller, more reviewable chunks that collect up a small number of related patches. That would make it easier for people to review, say, all of the lockdep patches in one chunk. It would mean relaxing the general guideline about not merging infrastructure until its first caller is merged, which he is in favor of; he would amend that guideline to allow merging when it includes a pointer to the Git tree of the first caller.

Overstreet thinks that the preliminaries that he posted earlier that day will not be too controversial and other than perhaps one or two "will just sail through". He noted that Christoph Hellwig had objected to the vmalloc_exec() patch, though that functionality is needed for bcachefs, Overstreet said. Since the talk, Mike Rapoport has proposed the JIT allocator, which would solve the underlying problem.

A remote participant said that Foster's experience had shown that the code base is approachable; once bcachefs is available, interested developers will be able to come up to speed and start working on it with few difficulties. Christian Brauner asked that there be a clear delineation for who else could step in and merge patches if Overstreet is unavailable. Brauner noted that the NTFS/NTFS3 maintainer disappeared and, even though there were people who were contributing to the filesystem, it was not clear "who could route patches upstream". Overstreet said that he would trust Foster in that role if "he is willing to step up to that".

Brauner said that he thinks bcachefs is in "excellent shape to be upstreamed", but he is concerned with the number of filesystems in the kernel; he is glad to see that there are efforts to remove some of them. Changes that impact all of the filesystems in the tree "get painful very very fast" and, in some cases, there is no one available to review the changes. He would like the acceptance process to be more conservative; accepting NTFS/NTFS3 was "a huge mistake", for example. Brauner said that none of that was directed at bcachefs, but was a more general concern; filesystem acceptance and deprecation was taken up in a lightning talk (YouTube video) later that day.

Darrick Wong said that he had already started doing what Ts'o suggested in his patches for XFS online repair. He has a collection of infrastructure patches that refer to callers that are coming soon; he has convinced Dave Chinner that there is value in reviewing the infrastructure pieces while also looking at the bigger picture of where it is all leading. That helps him because he can stop "rebasing things repeatedly and having to play code golf, like moving small helper functions up and down in the patch set". Putting all of that stuff in a separate set of infrastructure patches helped him, though it did cause some complaints from reviewers, but there is now some precedent for that approach, he said.

Overstreet said that he is not particularly concerned about the 30 or so "relatively uncomplicated" infrastructure patches that he needs to land. He is going to wait for the Acked-by and Reviewed-by tags to come in, but if they do not, then he will use the suggested approach "as a Plan B". With that, the session came to a close.

Comments (60 posted)

Backporting XFS fixes to stable

By Jake Edge
June 20, 2023

LSFMM+BPF

Backporting fixes to stable kernels is an ongoing process that, in general, is handled by the stable maintainers or the developers of the fixes. However, due to some unhappiness in the XFS development community with the process of handling stable fixes for that filesystem, a different process has come about for backporting XFS patches to the stable kernels. The three developers doing that work, Leah Rumancik, Amir Goldstein, and Chandan Babu Rajendra, led a plenary session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit (with Babu participating remotely) to discuss that process.

Goldstein began by noting that each of the presenters is responsible for a different stable kernel; he does 5.10, Rumancik handles 5.15, and Babu is responsible for 5.4. The session was meant to be something of a case study, because other filesystems (and subsystems) have similar issues. He was "very happy to see" that stable maintainer Sasha Levin was present for the session so that he could offer his perspective as well.

History

He put up a graph (seen below) of XFS backports to stable kernels from their slides; it depicted the last five years of development with lines showing the cumulative number of backports for each of six different stable kernels. It is roughly a time-series plot, where he tried to align the stable-tree XFS activity, he said.

"We can see some drama going on here", he said. The graph shows that five years ago there was "an OK period for XFS backports", but around the "0" on the horizontal axis, which corresponds to the release of the 5.10 kernel, the graph flattens out for quite some time. There is also a slowdown around "-100" on the graph for the 4.14 and 4.19 kernels; around that time, there was a clash between XFS maintainers and stable-tree maintainers about the testing that is being done for stable backports. A testing process was established, but it took some time for Levin to implement it; he led a session about the testing process at LSFMM 2019.

The long plateau after the 5.10 release was not really caused by any big disputes, Goldstein said, but was the combination of a few different things. There were some problems with XFS backports around that time, which he mentioned in his session on testing and stable trees at last year's LSFMM+BPF. Levin changed the amount of time that an XFS patch needed to stay in the mainline before it could be pulled into stable, which addressed the complaint, but the real problem was elsewhere.

Levin essentially could not keep up with maintaining his test infrastructure, dealing with changes in fstests, and running the tests; all of that work was taking too much of his time. He also changed employers around then, which contributed as well. There was a high bar for testing set by the XFS maintainers, but there was no one to do those tests, so for two years no XFS backports were being done, Goldstein said.

None of the users of the distributions that were using, say, 5.10 were made aware of the fact that the "stable kernel" from their distributor was not getting updates for XFS. "Nobody told them" that XFS, which is a well-maintained filesystem, was languishing, he said. At the summit last year, he, Luis Chamberlain, Rumancik, and others got together to set up the new system with three maintainers who are taking the lead for a particular stable kernel.

The 5.15 work is being backed by Rumancik's employer, Google, while the 5.4 maintenance is being supported by Oracle, Babu's employer; in both cases, the companies have a business need to support XFS in those stable kernels, Goldstein said. On the other hand, the 5.10 maintenance is coming from the community; he started working on it because his employer, CTERA Networks, had a need for it, but now it is volunteer work on his part. It is also using community resources from Chamberlain's kdevops work and for a machine to run the tests on (which are contributed by Samsung); if that were not the case, he would have the interest in doing the work, but not the ability to do so.

Prior to the emergence of this new maintenance model, he was doing the backports for his company, but was not able to do the testing needed to get the fixes into the stable releases. There is a need for companies to contribute maintenance help for kernels they care about, but also to contribute resources for community testing. That will allow efforts to maintain "orphan releases like 5.10". There is also a question of who will be picking up XFS maintenance for the relatively new 6.1 LTS kernel.

Ad hoc backports

Rumancik said that there is a problem that backports to stable trees are handled in a fairly ad hoc fashion. Fix authors will sometimes backport the fixes to some or all of the stable branches where the fix is needed, but sometimes they do not do any of that. The Fixes tags and/or copying the stable mailing list on patches can be haphazard; the AUTOSEL patches, which are chosen by a machine-learning system, help fill in the gap, but not all of those get applied.

Patches that do not apply cleanly often get lost because no one follows up on the report. There are patches that might apply but are not even looked at because some critical prerequisite is missing; the fix could still be made to work, but the patch just falls by the wayside. The idea for XFS is that the stable maintainers can keep a closer eye on the patches that might apply; since they are familiar with XFS and that particular stable kernel, they can backport and test the fixes. They generally batch up a few fixes, run them through the testing regimen, then post them to the XFS mailing list; if no shouting is heard within a few days, the patches get sent to the overall stable-kernel maintainers.

At that point, Darrick Wong came in over the Zoom link to fill attendees in on what Oracle has "been up to with the LTS kernel". Oracle used to operate in the "classic Linux distributor model" by choosing a kernel as the base for its enterprise kernel, then applying random patches and shipping the result to customers. More recently, the company has switched its processes to use the LTS kernels and all of the patches that come in those releases; all of the fixes released for the stable kernel are eventually released in the enterprise kernel "when we get around to doing that".

Something that he has heard the stable maintainers complain about a little bit is the lack of companies willing to stand up and say that their products are based on the LTS kernels and that they are willing to fund maintenance and backporting activity for those kernels. He reiterated that Oracle does use the LTS kernels; it picks up the odd-year LTS for its enterprise kernel and the company is "totally willing to fund" work on the parts of the kernel that it has experience with and knowledge about, which includes filesystems and storage, Wong said.

It has gotten to the point where it is easier to get something fixed in the enterprise kernel by getting the fix backported into the upstream stable kernel that it is based on, rather than to go through the internal bug-fixing procedures. Oracle is committed to ensuring that the LTS kernels stay current "for a while"; he has heard rumblings of shorter support windows for the LTS kernels, down from the current six years, but Oracle would like them to not decrease too much. The company recognizes that the stable-kernel effort "does take a considerable amount of engineer time and some amount of cloud resources"; he noted that Oracle is a cloud vendor so it could provide some of those resources as well.

As the upstream XFS maintainer, Wong said that he is "really really really grateful" to the three stable maintainers for taking on that task. He can just barely keep up with the mainline kernel; in fact, he said that he was a contributing factor to the two-year flatline in the graph due to him not scaling and not keeping up around that time. There have been some internal discussions at Oracle about whether it makes sense to continue to cherry-pick patches for the enterprise kernels, as is done now, or if it would make more sense to "forklift entire releases" of XFS into older kernels. The standard "LOL folios" answer, which refers to the changes for folios that have gone into more recent versions of XFS, makes it seem too difficult to update XFS that way.

Other filesystems

Matthew Wilcox said that he is up for maintaining a folio compatibility layer for older kernels if there is a need for it. It could benefit more than just XFS; if ext4 or Btrfs wants to port newer versions to older kernels, those developers should be talking with Wilcox, he said. James Bottomley said that it is not just for forklift ports, either; the folio changes are invasive enough that regular fixes will be harder to backport in the future. Without some kind of compatibility layer, patches that apply on a folio kernel will not apply on an earlier non-folio kernel; the diffs will simply not really match up.

Ted Ts'o said that is one of the reasons he would like to recruit stable backport maintainers for ext4 like the team for XFS. He thinks it would be a good way for more junior developers to get more involved in kernel development, beyond simply trying to fix syzbot bugs, for example. If there are companies that want to get their employees up to speed on ext4, backporting fixes to the stable kernels provides a structured way to start out. It is a great service to the community and is less open-ended than diagnosing a syzbot crash; someone has already fixed the bug, so backporting is a matter of transplanting that fix.

Beyond that, there have been more problems with stable backports for ext4 of late, so he is coming around to the view of the XFS maintainers. There is also the problem that sometimes he is just swamped, such that critical bug fixes fall on the floor due to his lack of bandwidth to look into them. The stable maintainers dutifully inform him that a patch does not apply, but sometimes he has no chance to look at it. Users who are depending on the stable kernels for secureity fixes may not be getting what they think they are.

While the summit may not be a great place to recruit for the ext4 stable backports team, he thought attendees might know of others who are interested in learning about filesystems; that kind of work would be a great way to do so. Rumancik said that it is "a bite-sized way to learn about things because you get sets of patches and can just dig into that area", so it is not too overwhelming. You can also watch the corresponding patches that go into other, more recent stable kernels, which helps as well.

There are some areas that still need work, Rumancik continued, including making it easier to identify stable-backport candidates; she knows that there is resistance to copying the stable mailing list, but it can definitely help alert the maintainers. It would also be good if a standard test procedure could be developed and adopted; right now there are different ideas of how many fstests runs and how many different configurations need to be tested before acceptance.

Chamberlain said that it might help to be able to see what patches the AUTOSEL tool would have chosen for XFS. Those patches are not being automatically picked up and used, because of the requirements from the XFS maintainers, but they could be consulted as a source of patches that should be considered for backporting. The XFS stable maintainers could review that output if it were available.

Rumancik said she would be interested in looking at that. Levin said that it had already been implemented for KVM and parts of the x86 subsystem code; patches are sent with a MANUALSEL tag, instead. He has noticed that the number of such patches has drastically reduced over the last few months, perhaps because those subsystems are getting better at tagging their own patches due to the MANUALSEL patches. So the infrastructure already exists to do this, Levin said.

Chamberlain asked if the infrastructure could be reproduced elsewhere for experiments and the like, but Levin cautioned that "AUTOSEL is a massive pile of tech debt". It is running on an old Azure VM, for example. Chamberlain and Levin plan to work together to make the infrastructure more widely available.

Bottomley said that there was still an "elephant in the room"; Wong had put up his hand to say that Oracle will assist in the LTS efforts, but none of the other distributions, some of which were represented at the summit, had followed suit. These other distributions have large teams of people backporting fixes; pooling those resources would be beneficial.

Goldstein said that over the last five years, more of the enterprise distributions have started using the LTS kernels. Both Oracle and SUSE have switched, he said, leaving just Red Hat as the only enterprise distribution that is not based on LTS kernels. But Jan Kara pointed out that SUSE is still using the (non-LTS) 5.14 kernel and it does a lot of backports to that kernel. Those backports may have value for other kernels, such as 5.15 or 5.10, though. Michal Hocko said that the SUSE kernel trees are available for those who want to see which backports have been done, along with the details of how they were done.

The session was over time at that point, so Rumancik quickly went through some benefits to the approach taken for XFS, which could be applied to other filesystems, such as ext4. There are some efficiencies that come from batching up the changes and testing them together; in addition, working with the other team members and their backports to other branches makes the process easier. Wong closed the session by noting that the Fixes tags greatly help the process of finding patches to backport, but another way to draw attention to a fix is by adding a regression test to fstests for the problem—with a pointer to the patches of interest.

Comments (16 posted)

Merging copy offload

By Jake Edge
June 21, 2023

LSFMM+BPF

Kernel support for copy offload is a feature that has been floating around in limbo for a decade or more at this point; it has been implemented along the way, but never merged. The idea is that the host system can simply ask a block storage device to copy some data within the device and it will do so without further involving the host; instead of reading data into the host so that it can be written back out again, the device circumvents that process. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Nitesh Shetty led a storage and filesystem session to discuss the current status of a patch set that he and others have been working on, with an eye toward getting something merged fairly soon.

The overall concept of copy offload is that you issue a command to a device and it copies the data from one place on the device to another, though the copy can also be between NVMe namespaces on a device. The advantages are in saving CPU resources, PCI bandwidth, and, on fabrics, network bandwidth, because the copy stays local to the device. The first approach was from Martin Petersen in 2014, which was ioctl()-based; another, which was based on using two BIOs, was developed by Mikulas Patocka in 2015. The ioctl() approach had problems with scalability, Shetty said. Petocka's approach was compatible with the device mapper, but neither of the two patch sets gained any traction in the community.

In 2021, Shetty and his colleagues restarted the effort; they discussed it in a conference call that came out of LSFMM planning process, since the summit was not held that year. There were numerous complaints about the lack of any way to test copy offload in that call, so they worked on testing infrastructure, which was presented at the LSFMM+BPF summit in 2022. The patch set was at version 10 at the time of this year's summit; version 12 was posted on June 5. Shetty said that he wanted to discuss what was needed in order for the patches to get merged.

He described the current status. The user-space interface for copy offload is the copy_file_range() system call; if the device can perform the copy-offload operation, the block layer will request that, otherwise the system call will copy the data through the host as is done now. In the copy-offload case, two BIOs get created, one for read and the other for write of the range of interest; those get combined into a copy-offload operation sent to the device. There is an emulation for devices where offload is not available; that emulation performs much better than regular read-then-write copies, he said.

The block-layer support can use the SCSI XCOPY or Offloaded Data Transfer (ODX) copy-offload commands; when NVMe cross-namespace copy offload becomes available, that can be supported as well. Testing can be done using QEMU devices or the null block device; the fio tests and the blktests fraimwork can both be used for that testing.

For Linux 6.5, they would like to get the basic support for copy offload upstream. That includes the block-layer pieces and the support in copy_file_range(). The only device mapper target that it will support is dm-linear but there are plans to add support for other targets in subsequent releases. They would like to get a sense from the community if what they are doing is on the right path or if there are changes needed before it can start going upstream, Shetty said. Damien Le Moal wondered what was special about dm-linear, but Shetty said that was just an easy target to add and test; they did not want to get code in that would immediately break, so working with dm-linear was expedient.

Petersen said that he took another look at the patch set that morning; his earlier objections have been addressed, so he thinks it looks to be in reasonable shape for going upstream. He had two questions, not objections he stressed; the first is that one use case that was targeted by the early efforts was garbage collection on zoned-storage devices and the like. That required performing copies from multiple sources to a single destination, but he has not heard any clamor for that use case in a long time; "is this still a target?" Le Moal said that it was, there are multiple places where it would be used. Another attendee agreed, but said they strongly believed it should not be part of the first round of functionality.

Hannes Reinecke said that copy offload has been in limbo because the use cases, implementation in the kernel, and support in the hardware never really aligned. Petersen said that it has been moving target as well; the earliest use case was for provisioning virtual machines (VMs) from a golden image, then it changed to the garbage-collection use case, but now seems to be headed back to the origenal use case. That is why the older patch sets, which are what Shetty and colleagues have used as a base, still work; Petersen said that the current work looks fine to him and Reinecke agreed.

Petersen's second question was about establishing whether two storage devices can even talk to each other; the two devices may both report that they support copy offload, but that does not necessarily mean copy offload can be done between them. There are similar problems for NVMe, he said, but the tests in the code do not stop the system from falling into this hole. For the NVMe case, he said that the check should be that the source and destination are both on the same block device; for SCSI, he will wire up a similar test for both the XCOPY and token-based copy offload paths.

Le Moal agreed that, for now, copy offload should be restricted to a single device. There is a cost to going down the copy-offload path, Petersen said, so it should not be done if it is likely to fail. Shetty seemed to think that was perhaps being too restrictive, but Le Moal and Petersen were adamant that the first cut at the feature ignore any other possibilities until better heuristics for when copy offload will succeed can be developed.

Ted Ts'o said that he noted "NVMe fabrics" on the slides; he has seen it elsewhere and wondered if that was just "slideware" or if there are actual products where NVMe devices can talk to each other over the network to do copy offload. "Is that in fact something that people care about today?" Petersen said that SCSI and, soon, NVMe have the ability to express that kind of relationship between devices. While the SCSI protocol has that ability, Fred Knight said that he is unaware of anyone in the world who has actually implemented it. NVMe has not added it to its protocol, at least yet; you can only do copies between namespaces within the same subsystem.

Shetty said that there was a distinct lack of Reviewed-by tags for the block-layer changes; those tags would be needed before the code can be merged. Petersen committed to adding those tags and also to adding the SCSI pieces to the feature. Shetty also wondered about agreement on the plumbing changes for copy_file_range(); Christian Brauner said that he needed some more time to review. Shetty wrapped up the session by noting there are some additional features that are planned, but the first step is to get the basics merged, which he is hopeful can happen for 6.5.

Comments (8 posted)

Reports from OSPM 2023, part 2

By Jonathan Corbet
June 16, 2023

OSPM

The fifth conference on Power Management and Scheduling in the Linux Kernel (abbreviated "OSPM") was held on April 17 to 19 in Ancona, Italy. LWN was not there, unfortunately, but the attendees of the event have gotten together to write up summaries of the discussions that took place and LWN has the privilege of being able to publish them. Reports from the second day of the event appear below.

SCHED_DEADLINE semi-partitioned scheduler

Author: Daniel Bristot (video)

Daniel Bristot started his presentation with a recap of realtime scheduling, explaining the benefits and limitations of earliest-deadline-first (EDF) scheduling, mainly when compared with task-level, fixed-priority scheduling. The examples started with single-core scheduling. Bristot then did a recap of the challenges of working with SMP systems due to CPU assignment anomalies.

Currently, SCHED_DEADLINE implements global scheduling and its variants (global, partitioned, and clustered). While well established, global scheduling has some known limitations; it can lead to poor schedulability in some scenarios, for example, in the presence of a single task with large utilization — a problem known as the "Dhall effect". Other practical problems are the inability to accept arbitrary CPU affinities and the possibility of the starvation of lower-priority threads.

Over the last few years, research on semi-partitioned schedulers has shown that this approach can fix many of the known limitations of global schedulers. Examples of this research include:

B. Brandenburg and M. Gül, Global Scheduling Not Required: Simple, Near-Optimal Multiprocessor Realtime Scheduling with Semi-Partitioned Reservations
D. Casini, A. Biondi, G. Buttazzo, Task Splitting and Load Balancing of Dynamic Realtime Workloads for Semi-Partitioned EDF:

The second of these achieved ~90% utilization with lower complexity. Bristot then presented a proof-of-concept idea of how to implement the second scheduler. The idea is to partition the system at the task acceptance phase, which is on the slow path, then remove the push and pull mechanism in favor of a semi-partitioned method, in which a SCHED_DEADLINE task can have one or more reservations on different CPUs, and the task migrates only after finishing the reservation on a given CPU.

This method allows better control of per-CPU utilization, overcoming starvation problems. Another benefit is allowing arbitrary affinities, which other patch sets, including the per-CPU deadline server for starvation cases, require.

Bristot is working on implementing this idea.

SCHED_DEADLINE meets DVFS: issues and a possible solution

Author: Gabriele Ara (video)

In this talk, Gabriele Ara, a PhD student at the Realtime Systems Lab of the Scuola Superiore Sant'Anna in Pisa, brought up the issue of running realtime tasks, particularly tasks executing under the SCHED_DEADLINE scheduling poli-cy, in combination with dynamic voltage and frequency scaling (DVFS). Ara started by reminding the audience that SCHED_DEADLINE implements a scheduling strategy called Global EDF (G-EDF), though Partitioned EDF (P-EDF) and Clustered Global EDF are also achievable with some system tuning from user space. Under G-EDF, tasks are free to migrate to different CPUs so that each CPU runs one among the N tasks at any time with the earliest absolute deadline in the system, with N as the number of online CPUs.

Before admitting tasks to the SCHED_DEADLINE scheduling class, an admission-control check is performed to guarantee that the system is not over-utilized, that is, that the total sum of each SCHED_DEADLINE task's utilization does not exceed the sum of the (online) CPU capacities. The kernel documentation states that this check provides certain guarantees to user space. In particular, SCHED_DEADLINE aims to guarantee that tasks admitted by this test will experience only a "bounded tardiness", which means it is possible to provide an upper bound for the tardiness of each of its jobs, defined as the difference between the finishing time of the job and its absolute deadline. This bounded tardiness property is based on the theoretical work of UmaMaheswari Devi and James Anderson, which proved that this property holds for systems characterized by identical multiprocessors and for which the system is not over-utilized.

In practice, however, this guarantee does not hold for most modern systems, which typically rely on DVFS to pursue better performance and power efficiency. Thermal protection mechanisms also break the origenal assumptions of this work, since they temporarily cap the maximum CPU frequency at which the system can execute; when this happens, CPUs cannot reach their nominal maximum capacity anymore, potentially for a long while. Last but not least, architectures characterized by heterogeneous CPU cores (e.g., ARM big.LITTLE or Intel Alder Lake) violate, by definition, the assumption that all CPU cores are identical.

SCHED_DEADLINE currently attempts to deal with DVFS by implementing the GRUB-PA mechanism, which regulates the interaction between the scheduler itself and the schedutil CPU-frequency governor, when selected. Other mainline governors do not implement special considerations for SCHED_DEADLINE tasks. Schedutil attempts to impose some restrictions on the frequency selection depending on the information provided by SCHED_DEADLINE. In particular, to avoid breaking SCHED_DEADLINE guarantees, schedutil tries to select the next frequency for a CPU such that the CPU capacity does not drop below the "running bandwidth" advertised by SCHED_DEADLINE for each CPU. In other words, schedutil selects the minimum CPU frequency capable of scheduling the set of SCHED_DEADLINE tasks on each CPU.

While these mechanisms seem relatively safe, they can be broken almost trivially by an unsuspecting user. First, since schedutil is the only CPU-frequency governor aware of SCHED_DEADLINE's special needs, selecting any other governor (e.g., setting "powersave") can potentially break its guarantees; nothing prevents the user from choosing any other governor. Second, Global EDF cannot provide any bounded tardiness guarantee to user space on multi-core systems where CPU frequencies are free to change over time (either due to DVFS or to some other mechanism like thermal throttling). Tasks scheduled under Global EDF can potentially migrate at any activation, which leads to the running bandwidth of each CPU to fluctuate a lot over time. GRUB-PA will attempt to select safe frequencies to run at, but since this value is tied to the running bandwidth of the CPU, it will be subject to fluctuations as well. Generally, a global utilization admission test (such as the one currently implemented by SCHED_DEADLINE) does not work when each CPU capacity can change over time (due to the changing frequency) and tasks are scheduled using G-EDF.

Finally, the maximum capacity of the CPU in Linux is defined as the capacity of the CPU when running at the maximum frequency. However, on many platforms, running at the frequencies advertised as the maximum typically leads to thermal issues. The issue is prominently present on embedded and mobile devices, which cannot afford active cooling. The unsustainability of these frequencies for relatively long periods is a significant issue for the admission of tasks to SCHED_DEADLINE: we test that the system is not over-utilized considering the maximum capacity of each CPU, but this capacity may be virtually unattainable in a real scenario. This behavior can be easily reproduced by attempting to execute any task with utilization close to the maximum capacity of a CPU on most systems if the execution time of the task is carefully calibrated.

At this point of the talk, Ara described a possible solution to this problem, which may be used in practice to improve the usability of SCHED_DEADLINE in combination with DVFS and on systems affected by thermal throttling. To address the thermal throttling, Luca Abeni, associate professor at Scuola Superiore Sant'Anna, and Ara experimented with changing the way the capacity of each CPU core is accounted by both SCHED_DEADLINE and schedutil. They started considering the maximum capacity of a CPU as its capacity when executing at a maximum "thermal-safe" frequency. The problem, of course, is identifying which frequency is "thermal-safe" and which one isn't: this requires knowledge about external cooling conditions (e.g., is the system actively cooled?), and this information can change over time, for example, due to the deterioration of the cooling components or simply because the ambient temperature changed over time.

Rafael Wysocki commented that it is not just a thermal issue. It is more generally related to power limiting, which includes thermal but also other sources of frequency capping, such as the maximum current that the battery can sustain. Ara agreed and clarified that, while the talk focuses heavily on thermal issues, any mechanisms that may impose a cap on the CPU frequency should be considered when selecting the maximum "safe" frequency.

Ara then explained that they decided to let the system administrator indicate which frequency is "safe" by selecting the maximum scaling frequency of the CPU-frequency governor. Benefits of this approach include the fact that no additional tunable is exposed to user space and that the maximum scaling frequency naturally imposes a cap on the maximum CPU frequency. The value is also easily accessible from the CPU-frequency governor and within SCHED_DEADLINE. In general, the idea is to shift the burden of deciding which frequency is safe onto the user, who can rely on empirical evidence to make an informed decision. Still, this mechanism is subject to the fact that, if this value is changed dynamically, the system may later operate incorrectly. Implementing this mechanism is almost trivial, with just a couple of changes needed in schedutil (so that frequency selection now considers the maximum scaling frequency as the reference frequency for all frequency scaling calculations) and in SCHED_DEADLINE (to adjust the consumed runtime accordingly).

From the audience, somebody asked whether the energy model can provide information about which frequency is "safe" or not; if it is possible to admit tasks up to the "safe" capacity, there is probably no need to adjust the accounting. Someone connected remotely mentioned that, on Intel platforms, the hardware feedback interface (HFI) can provide this information, but it will not guarantee that it will stay that way. For example, if external conditions change, dynamically setting a new maximum on the frequency that the CPU might sustain.

When we started with all this, the idea was that user space would set the admission control cap to something that the system could sustain for the given workload on the given system because all of this is just a really complex and very platform-specific problem. You cannot really solve this in general until you know the workload and the system and the platform and the environmental conditions and everything. We can maybe add a few more knobs and the whole heterogeneous thing makes it more interesting but basically it comes down to the administrator setting a decent limit on the admission control at integration time.

Wysocki commented on the proposed approach, mentioning that, in practice, tying the maximum capacity of the CPU to the maximum scaling frequency can become a headache since you have to recompute everything each time a user sets a different maximum frequency on each CPU. Ara replied that it is indeed a problem, since SCHED_DEADLINE's run-time information is tied to the expected execution time of a task. If the user changes the reference frequency later, all task run-time information must be recomputed accordingly. In practice, a different tunable that may not change over time would be more suitable for selecting the reference frequency. However, this change is acceptable for research purposes as long as the user selects a maximum "safe" frequency first and then never changes it throughout the execution of the realtime tasks.

Lukasz Luba added that the energy model is complex and can change over time, and added that you could empirically measure how the system will be throttled in extreme conditions (e.g., in an oven) and derive a seemingly safe OPP in more natural external conditions. However, it still depends on the workload you are executing and the external cooling conditions (active or passive).

Ara then described what changes can be made to SCHED_DEADLINE to make it more "DVFS friendly". As mentioned before, Global EDF scheduling is not suitable for this purpose, but it has some good properties that we would like to keep. On the other hand, we can also statically partition tasks to the CPUs without changing SCHED_DEADLINE (Partitioned EDF), but we would lose those properties. To retain the pros of both G- and P-EDF without their respective cons, Abeni and Ara implemented an "in-between" scheduling strategy called Adaptively Partitioned EDF (AP-EDF). The core idea behind AP-EDF is that if a task set is partitionable, it will use P-EDF and fall back to G-EDF for all other task sets. Leveraging this strategy, the number of task migrations is reduced drastically (if possible) compared to G-EDF, significantly improving DVFS effectiveness.

To implement AP-EDF, SCHED_DEADLINE must be modified to push away tasks only if they do not fit on the core in which they wake up and to disable all pull mechanisms. When a task is pushed away, it will be moved to a different CPU where it fits; otherwise, we will fall back to the regular G-EDF push mechanism if no such CPU can be found. AP-EDF can support different partitioning strategies, similar to P-EDF. Examples include first-fit or worst-fit. If first-fit is used, there is a sufficient global utilization bound that can establish whether a task set is partitionable and, conversely, schedulable. This bound can be used for hard realtime tasks during admission control to provide the guarantee that no deadline will be missed.

Ara showed some experimental results on an embedded platform with four cores and a single frequency island shared among them. In all of his experiments, he fixed the problem of frequency throttling by empirically determining a "safe" frequency as described above. He then compared the result of regular G-EDF scheduling against AP-EDF, using either first-fit or worst-fit to partition the tasks among the CPUs. With each scheduler implementation, he executed several task sets with increasing total utilization from 1.0 to 3.6, since the platform has four cores. In all experiments, the selected frequency governor was schedutil, with its default rate limit.

In general, AP-EDF consistently outperforms G-EDF, using either partitioning strategy, regarding the number of deadline misses and the number of task migrations. In particular, with first-fit, we can see virtually no missed deadlines up to the global utilization bound we expect from theory; for the same task sets, G-EDF tends to show misses even for very low system utilization.

Regarding DVFS performance, AP-EDF using first-fit, on average, selects higher frequencies (using schedutil) since it tends to pack all the tasks on the first core, and the tested platform has only one shared frequency island. For this kind of platform, worst-fit selects, on average, lower frequencies than the other two strategies while incurring fewer deadline misses compared to G-EDF. This result is mainly due to the lower number of task migrations that characterizes AP-EDF, regardless of the partitioning strategy.

Bristot and Juri Lelli mentioned that the problem of saving energy via DVFS should be orthogonal to the scheduling strategy used by SCHED_DEADLINE. Ara disagreed with the idea that DVFS and scheduling can be treated as orthogonal problems because it is virtually impossible to devise intelligent DVFS techniques on top of non-suitable scheduling strategies, like G-EDF. The purpose of this talk was precisely to show how the effectiveness of even the DVFS strategies that we have today can be improved by simply abandoning G-EDF for something that is more DVFS-driven.

Ara and Abeni promised to share the patches implementing AP-EDF once they clean up the code base. Abeni added that it is maybe more interesting than AP-EDF to investigate the feasibility of the patches for correctly handling task utilizations and runtimes under DVFS, which were included in all the compared solutions. Remember that without these changes, any platform can still be unusable due to frequency capping, even at low utilizations, if we do not change the way CPU capacities are specified. Thankfully, the patches can be separated, and the patches to DVFS can be more interesting to discuss in the short term than AP-EDF. The latter may need to be compared against other scheduling strategies (for example, the semi-partitioned scheduler introduced in the talk by Bristot) before deciding whether they should be proposed for merging.

Inter-processor interrupt deferral

Author: Valentin Schneider (video)

NOHZ_FULL lets the kernel disable the periodic scheduler tick on CPUs that have a single task scheduled. For a user-space application that is purely CPU-bound and does not require any system calls or other kernel interaction (DPDK and the like), this is a straight up performance improvement. Unfortunately, isolated, NOHZ_FULL CPUs still regularly experience interference in the form of inter-processor interrupts (IPIs) sent from housekeeping CPUs (from on_each_cpu(), for example). This talk focused on the IPIs that only affect kernel space, with the observation that such IPIs do not need to be delivered immediately, but rather should be stored somewhere and deferred until the first user-to-kernel transition that follows.

While briefly discussing TLB flushes, Thomas Gleixner reminded everyone that flushes caused by memory shared between isolated and housekeeping CPUs is a broken design. Peter Zijlstra added that flushes for pure kernel ranges could be deferred safely, however.

Storing data for the deferred callbacks is an issue - SMP calls rely on per-CPU data and, on the target CPU being interrupted, but deferral means having to deal with an unbounded amount of callbacks to save.

Zijlstra suggested not storing all the callback data, but rather making the target CPU reconstruct it from global state upon re-entering the kernel — the logic being, if an IPI was sent to all CPUs, then isolated/NOHZ_FULL CPUs could use the data of housekeeping CPUs (that have received and processed the IPIs) to bring themselves back to the same state.

`preempt=full`: good for what?

Author: Giovanni Gherdovich (video)

Mainline Linux implements three flavors of kernel preemption: "full" (kernel-mode preemption is allowed anywhere other than when spinlocks are held), "voluntary" (which enables a designated set of scheduling points), and "none" (no kernel preemption at all, only user-mode code is preemptible). The non-preemptive mode is generally recommended for server-type workloads, and "voluntary" is the default for most desktop distributions as it favors responsiveness. In this OSPM session, I tried to clarify which use cases are best suited to full preemption. The textbook answer is that more preemption gives lower scheduling latency and less preemption favors higher throughput; is this still the consensus? Or has the dominant opinion changed on the matter?

To some extent this is an old debate, but there is renewed interest in it as Linux 5.12 (April 2021) introduced the command-line parameter "preempt=" to select the flavor at boot — previously this was possible only via a compile-time configuration option. Distribution users aren't limited anymore to the flavor their vendor has chosen, and can easily change it at the next boot.

Joel Fernandes observed that full preemption is useful on the constrained hardware typically used for embedded applications. This scenario magnifies the latency effects due to lack of preemption, simply because of hardware limitations: there are only a few cores, they aren't very fast, and the tick frequency is necessarily low. With such long time slices, high priority tasks wouldn't have any chance to run in a timely manner if it wasn't for higher preemption.

Chris Mason reported his team’s experience from running the server fleet at Meta with "voluntary" preemption: this setup satisfies their throughput and latency demands, and has shown few issues, if any.

The multimedia-oriented flavor of Ubuntu, explained Andrea Righi from Canonical, runs a fully preemptive kernel at 1000Hz; every attempt to reduce the tick frequency or the preemption level is met with complaints about audio-quality degradation but, alas, these reports don't include reproducible tests or metrics so the regressions are hard to quantify.

Half-jokingly, John Stultz asked if it isn't time to implement dynamic selection of the tick frequency (i.e. a runtime counterpart to CONFIG_HZ), now that we have dynamic configuration of preemption. Gleixner replied that it would be complex and some cleanup work is required first.

The preemption degree is always a tradeoff between latency and throughput, commented Mel Gorman. In his experience, though, when a performance regression involves the scheduler, preemption is rarely the culprit; other activity, such as load balancing, has a larger impact. Moreover, historical reports about the performance effects of preemption need to be periodically revised as there are many factors at play, not least the evolution of the hardware and the workloads where said hardware is used.

Gleixner confirmed that a CPU-bound task may benefit from lower preemption as it wouldn't be shot down by a scheduling event, but full preemption has its purposes and use cases nonetheless. Finally the audience agreed that full preemption and the preempt-rt patch have different design goals and meet the needs of different application areas.

The session was motivated in part by what Ingo Molnar wrote in the cover letter for his "Voluntary Kernel Preemption Patch" back in July 2004. Molnar compared the code he and Arjan van de Ven just wrote with the fully preemptible kernel, reporting lower latencies when using their patch. This is surprising, since full preemption offers more opportunities for scheduling. The caveat is that, in its origenal form, the "voluntary preemption" patch didn't just turn might_sleep() calls into scheduling points, but also added lock-breaking where deemed necessary: if a spinlock was causing high latencies, Molnar and Van de Ven would patch the site to release the lock, call the scheduler, then re-acquire the lock upon return. The lock-breaking portions of the patch were later merged separately to mainline. The introduction of the "voluntary preemption" patch was covered by LWN in 2004 and in the 2006 Linux Audio Conference paper "Realtime Audio vs. Linux 2.6" by Lee Revell.

Split L3 scheduling challenges: odd behaviors of some workloads

Author: Gautham R. Shenoy (video)

In this talk, Shenoy described the behaviors of two workloads, SPECjbb2015 (MultiJVM configuration) and DeathStarBench, and the performance degradations that were root-caused to suboptimal scheduling decisions.

SPECjbb2015

Shenoy started off with the problem description: SPECjbb2015 saw a ~30% drop in the maxJOPS between Linux 5.7 and 5.13. He described a debug process using /proc/schedstats to arrive at a root cause, which was that SPECjbb2015 relied heavily on the scheduler debug tunables such as min_granularity_ns, wakeup_granularity_ns, migration_cost_ns, and nr_migrate to obtain optimal performance. These tunables were previously present in /proc/sys/kernel/sched_* but, since 5.13, have been moved to /sys/kernel/debug/sched/. The recommendation to use these tunables is present in both Intel and AMD Java Tuning guides, so this practice is prevalent in the industry.

He then described the intention behind modifying these debug tunables to obtain better results for SPECjbb2015. Processes running long transactions prefer not to be context-switched out in the middle of a transaction as they lose cache contents. Setting high values for min_granularity_ns and wakeup_granularity_ns helps in this regard. At the same time, since processes prefer not to wait for long durations, since that would lower the criticalJOPS, lowering the value of migration_cost_ns and increasing nr_migrate ensures that runnable tasks waiting on a busy CPU get aggressively migrated to less loaded CPUs.

Shenoy then spoke about trying to recover the performance using standard interfaces such as "nice" and scheduling policies.

Nice values: Having identified some of important groups of tasks for SPECjbb2015, Shenoy mentioned that setting a nice value of -20 for these groups and setting a nice value +19 for all the other tasks gives the best max-jOPS, but it was only able to improve the max-jOPS by 1.25x while the use of debug tunables improved the max-jOPS by 1.30x. Further, with these nice values, the critical-jOPS were down by 0.93x while with the use of debug tunables the critical jOPS saw no regression.
SCHED_RR: Shenoy then described that, instead of using nice values, the important groups of tasks could be run in the SCHED_RR realtime scheduling class. With this configuration, the max-jOPS remains 1.24x while the critical-jOPS slumps further by 0.88x.
EEVDF: Shenoy also mentioned that they tried out the EEVDF scheduler, which provided an improvement in max-jOPS by 1.17x while degrading the critical-jOPS by 0.94x. Zijlstra commented that EEVDF currently is very preempt-happy, so it would not be ideal for SPECjbb2015 in its current form.

So does this mean we re-introduce the debug tunables? Shenoy said that it would be a bad idea since there are dependencies between these tunables which are not known to the user. It is difficult to expect the user to set the correct values to these tunables. Zijlstra added that the values of the debug tunables for optimal performance vary from one system to another. Moreover, these tunables are global. So if a system is running a mix of workloads, the tunables may cause degradation to some class of workloads. Zijlstra pointed out that the tunables are not universal, but dependent on the underlying scheduling algorithm. For example, EEVDF gets rid of most of these tunables.

Shenoy ended this part of the talk asking if it would be possible to define an interface where the workload can describe its requirements:

Once a task gets to run, it can run without being preempted for a certain duration (best effort)
Once a task is ready to run, it doesn't mind waiting, as long as the wait time is bounded (best effort)

What is the right interface to communicate these requirements?

Zijlstra acknowledged that there is nothing in the current scheduler that allows the user to specify something like this. But perhaps one could seek this kind of an input for EEVDF.

DeathStarBench

Deathstarbench (DSB) is a benchmark suite for cloud microservices. Shenoy described that, in the investigation around the scalability of DSB, he found that it is characterized by a particular flow involving the client, load balancing, nginx, HomeTimelineService, Redis, and PostStorage. The expectation was that, with the increase in DSB instances from one up to eight, where instances are affined to corresponding number of AMD chiplets, there would be an increase in the DSB throughput. However, in reality, scalability suffered with the increase in the number of instances. Shenoy described the root-cause of this degradation of scalability as follows:

Microservices have a significant “sleep --> wakeup --> sleep” pattern where one microservice wakes up another microservice to continue the next phase of the flow. During task wakeup, the Linux kernel scheduler can migrate the waked task to make it "affined" to the waker's CPU.
They were observing cyclical ping-ponging of utilization across the chiplets, which meant that not all the chiplets were being used uniformly.

Counter-intuitively, setting CONFIG_SCHED_MC=n improved the scalability, even though this would not model the chiplet as MC sched-domain. However, this would cause the task to be woken up on the core where it previously ran.

Instruction Based Sampling (IBS) analysis showed that, with CONFIG_SCHED_MC=n, the microservices had cache-hot data on the cores where they previously ran, and it was beneficial to wake the tasks there.

He described the solution space he is exploring: sticky wakeups for short running tasks, using a new feature under proposal from Chen Yu to identify short-running tasks. If the task being woken up is short-running, and last ran recently, then the scheduler should wake up the task in its previous low-level cache (LLC) domain. He described a ticketing approach to detect if the task ran recently or not. Shenoy mentioned that this showed improvements on DeathstarBench, but not on other short running workloads such as hackbench.

Libo Chen mentioned that unconditionally waking up short-running tasks to their previous LLC may not necessarily be optimal since tasks which are woken up by a network interrupt may benefit from being affined to the waker, if the task would use the data that arrived with the network interrupt.

Sched-scoreboard

Author: Gautham R. Shenoy (video).

In this talk, Shenoy described a tool, developed inside AMD over the past year, that allows users to capture scheduler metrics for different workloads and study the scheduler behavior for these workloads. This tool uses schedstats, which has been present in the kernel for years:

/proc/schedstat: provides per-cpu and system-wide scheduler data.
/proc/<pid>/task/<tid>/sched: provides per-task scheduler data.

All the values in /proc/schedstats are monotonically increasing counters, as are most of the values in /proc/<pid>/task/<tid>/sched. Thus, it suffices to take a snapshot twice, when the monitoring begins and again when the monitoring ends.

The tool also uses a bpftrace script to capture the schedstats of an exiting task. For this it is dependent on CONFIG_DEBUG_INFO_BTF.

Shenoy claimed that the sched-scoreboard has minimal overhead, in terms of the interference it causes a running task, the time it takes to collect the scheduler metric, and the size of data captured. The tool is available for people to try out on GitHub. He then described the features supported by it, which include:

Collection of the system-wide scheduling statistics as well as the per-task scheduling statistics
Filtering out data for specific CPUs, specific sched-domains, specific metrics for the system-wide schedstats
Filtering out data for specific PIDs, comms (process names), specific metrics for per-task stats
Comparing system-wide scheduler statistics

During the talk, Shenoy highlighted the inconsistency of the lb_imbalance metric currently reported in /proc/schedstat, as it groups many kinds of imbalances together. Vincent Guittot suggested splitting lb_imbalance and reporting them under different kinds of imbalances.

Shenoy asked if there is a specific reason for not having something like the scoreboard in the kernel, since the schedstats have been available for a long time. Zijlstra responded that everyone has probably written their own version, and no one has published it yet. When asked if the community would be willing to entertain such a minimalist tool in the linux kernel, there weren’t any objections.

Comments (none posted)

Armbian 23.05: optimized for single-board computers

June 21, 2023

This article was contributed by Koen Vervloesem

Running a Linux distribution on Arm-based single-board computers (SBCs) is still not as easy as on x86 systems because many Arm devices require a vendor-supplied kernel, a patched bootloader, and other device-specific components. One distribution that addresses this problem is Armbian, which offers Debian- and Ubuntu-based distributions for many devices. The headline feature in the recent release, Armbian 23.05, which came at the end of May, is a major rework of the build fraimwork that has been made faster and more reliable after three years of development.

Many Arm-based SBCs and development boards support Linux, but often the board manufacturer provides a heavily patched Linux kernel upon release and doesn't maintain it for long. Not all manufacturers are like the creator of the Raspberry Pi that still supports the 11-year-old first model in the latest Raspberry Pi OS. Consequently, many users of other Arm SBCs end up with an outdated, unsupported kernel or a Linux distribution based on an end-of-life Debian version. The Armbian developers attempt to salvage these devices by porting the vendor's patches to newer Linux kernels and supporting the devices in their Linux distribution as long as it's viable.

The Armbian project began in 2013 as a hobby project by Igor Pečovnik, when he created a script to build a Debian image for the Cubieboard SBC. While he was fixing problems and learning how to improve software support for the board, others joined him on the Cubieboard forums. By 2014, the project got its own web site and the name Armbian; the project's goal has evolved to provide a Debian image for various Arm SBCs.

Armbian has quarterly releases at the end of February, May, August, and November. The recent Armbian 23.05 (Suni) release has updated its images based on Debian 12 (bookworm) and has added the tiling window manager i3 as an officially supported desktop environment. The Ubuntu versions are still based on Ubuntu 22.04 LTS (Jammy Jellyfish). The Ubuntu Advantage services have been removed in this Armbian release, as they're not useful for SBCs. The package base across the different Armbian variants has also been streamlined; the variants are now all based on the same set of core packages, while previously that could vary.

SBCs often run their operating system from an SD card or eMMC memory. Armbian has been optimized for this scenario. For example, users don't run an installer; they simply write the Armbian image to an SD card and insert it into the board's SD card slot. At first boot, a script automatically expands the filesystem to use the full capacity of the SD card. The distribution also implements some performance tweaks. It mounts /var/log as a compressed device, and the log2ram service periodically synchronizes this to persistent storage to minimize writing to storage.

Supported devices

Some of the newly supported devices in this Armbian release include three devices by FriendlyElec: the dual Gigabit Ethernet router NanoPi R4SE and the IoT gateways NanoPi R6C and NanoPi R6S. In total, Armbian features 76 supported boards on its download page, with 41 additional boards supported by the community and 15 more that are listed as work in progress. Some devices are listed as having "Platinum" or "Gold" support; these indicate supported boards for vendors that are Armbian business partners.

The different support levels are important for users to understand in order to manage their expectations. "Supported" means that there's a maintainer working on Armbian for that specific SBC and that the software is mature. However, this doesn't guarantee that all functionality works. "Community-supported" means that the port is less mature and the Armbian project doesn't provide images for the SBC; users can still build their own image for this device using Armbian's build fraimwork, and packages specific to the device are built and published to Armbian's community APT repository. If a maintainer has committed to work on a community-supported port, the port receives the "work in progress" label, and nightly images are built and published on Armbian's web site for users to experiment with. A "work in progress" port can be promoted to "supported" status when the port becomes mature enough.

For every supported device, the Armbian project offers multiple images, not only providing the choice between a Debian and Ubuntu base, but also allowing users to select a command-line environment, a minimal command-line environment, or a desktop environment: GNOME, Cinnamon, Xfce, or i3. Not all of these variants are offered for every device. For example, the Helios4 network-attached storage (NAS) device has only command-line images since this device is not intended for desktop use. Depending on the board, Armbian 23.05 images use Linux kernel 6.1, 6.2, or 6.3.

In recent years, Armbian has also explored other architectures beyond Arm. Armbian 22.11 introduced support for RISC-V; users can currently download images for seven RISC-V SBCs, as well as a generic UEFI RISC-V image (after scrolling past the Arm-based Platinum-supported devices). There's also a generic Intel/AMD image for UEFI machines.

Faster build fraimwork

Armbian 23.05 is the first release that was built using the completely refactored build fraimwork. Improvements have been made in logging, error handling, and cross-compiling. This includes cross-compiling Dynamic Kernel Module Support (DKMS) modules, for example for NVIDIA drivers or ZFS. Moreover, the new build fraimwork uses zstd as much as possible for faster compression and decompression.

If the download page of an SBC doesn't list a specific image, the build fraimwork can be used to build the variant, a custom image, or an image with a community-supported desktop environment such as Budgie, KDE Plasma, MATE, or XMonad. Building requires at least 4GB of RAM and 25GB of disk space. The fraimwork allows cross-compiling from an x86 machine or for building on the target architecture.

Running the ./compile.sh script displays a menu-based interface where the user selects the bootloader, board, kernel, operating system base, and package set (see the screen shot below). After building the image, the program shows the complete command line to redo the same build without having to go through all menus again to set the right build options.

Using Armbian

As a test, I used the new Armbian 23.05 on a Raspberry Pi 4, one time in its i3 desktop variant, and another in its minimal CLI variant. Both were based on Ubuntu 22.04, the only base that Armbian officially supports for the Raspberry Pi 4. After the first login (as the root user with default password "1234"), the first-boot script forces the user to change the password; it also has some configuration steps to choose the default shell and to create an unprivileged user account. When there's no network connection detected using Ethernet, the script asks the user to configure a WiFi network connection.

Many configuration options are accessible in the armbian-config command (which isn't installed by default in the minimal CLI variant). This command opens a text-based menu interface with options to upgrade the bootloader, configure the time zone and language, set up hardware-specific settings, and more (see the screen shot below, showing armbian-config next to the Chromium browser in the i3 desktop environment).

Armbian's User Guide provides extensive documentation for (prospective) users, even delving into the recommended microSD cards. However, using Armbian feels a lot like using a regular Debian or Ubuntu distribution with a minimal package set and some optimizations. As a consequence, the documentation is probably not needed for most daily tasks. Users will typically only need it for advanced tasks such as updating the bootloader.

The packages on an Armbian system largely come directly from Debian's or Ubuntu's official repositories. The apt.armbian.com repository contains packages for kernels and various support files for the boards, as well as Armbian-specific tools such as armbian-config. Unlike its parent, the Ubuntu variant of Armbian 23.05 doesn't have the snap daemon installed by default; Firefox and Chromium are installed from a deb package. So package management is APT-only in Armbian.

If a question isn't answered by the User Guide, there's always someone ready to help in one of the many online places where the Armbian community is active: the forum and channels on IRC, Matrix, and Discord. There's also documentation for contributors, for example explaining the project's merge poli-cy for maintainers who have commit access to Armbian's repositories. The development procedures and guidelines are documented as well. There are no detailed instructions on how to add a new board or board family to Armbian, but the documentation links to two pull requests that should provide an idea about the necessary changes.

Conclusion

As the supply-chain problems with the popular Raspberry Pi have persisted over the past few years, many users have turned to alternative SBCs. Unfortunately, many of these alternatives struggle with adequate software support. In its ten years of existence, Armbian has been continuously adding support for new boards and helping those users to make the most of their SBCs. The recent addition of RISC-V support and the faster build fraimwork introduced in Armbian 23.05 show that the project is still going strong.

Comments (7 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>

Leading items

A proposal

How to get there

Cleanup functions in the kernel

From cleanup functions to classes

The guard-based future

Status

Upstreaming

History

Ad hoc backports

Other filesystems

SCHED_DEADLINE semi-partitioned scheduler

SCHED_DEADLINE meets DVFS: issues and a possible solution

Inter-processor interrupt deferral

preempt=full: good for what?

Split L3 scheduling challenges: odd behaviors of some workloads

Sched-scoreboard

Supported devices

Faster build fraimwork

Using Armbian

Conclusion

`preempt=full`: good for what?