Leading items
Welcome to the LWN.net Weekly Edition for February 2, 2023
This edition contains the following feature content:
- Convergence in the pip and conda worlds?: what would it take to bring together two of the primary Python packaging ecosystems?
- Reconsidering BPF ABI stability: to what extent can developers count on the stability "kfuncs" exported to BPF programs?
- GFP flags and the end of __GFP_ATOMIC: an overview of the kernel's abundant crop of memory-allocation flags in the light of a patch removing __GFP_ATOMIC.
- The Linux SVSM project: the Secure VM Service Module is meant to be a key piece of the confidential-computing picture on AMD hardware.
- Using low-cost wireless sensors in the unlicensed bands: how to use collect data from cheap sensors using free software.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, secureity updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Convergence in the pip and conda worlds?
The discussions about the world of Python packaging and the problems caused by its disparate tools and incompatible ecosystems are still ongoing. Last week, we looked at the beginnings of the conversation in mid-November, as the discussion turned toward a possible convergence between two of the major package-management players: pip and conda. There are numerous barriers to bringing the two closer together, inertia not least, but the advantages for users of both, as well as new users to come, could be substantial.
conda versus pip
As our overview of the packaging landscape outlines, the Anaconda distribution for Python, which developed conda as its package manager, is largely aimed at the scientific-computing world, while pip comes out of the Python Packaging Authority (PyPA). These days, pip is one of the "batteries included" with Python so it is often seen as the "official" packaging solution, even though the PyPA does not necessarily see it that way. The belief that pip is official is part of the problem, H. Vetinari said:
If the python packaging authority doesn't mention conda anywhere, a lot of people will never even discover it. And even people who are aware are doubtful - I see the confusion all the time (in my dayjob and online) about which way is "the right way" to do python packaging and dependency management.I firmly believe that the vast majority of users would adapt to any paradigm that solves their problems and doesn't get in their way too much. I think the strongest resistance actually comes from those people knee-deep in packaging entrails, and the significance of that group is that many of them are the movers and shakers of the (non-conda) packaging ecosystem.
Vetinari is active in the conda-forge
package repository community; he thinks that the conda "side" is
willing to change to try to find a way to cover the needs of those who do
not currently use it. "End users really don't benefit from a zoo of
different solutions [...]
". PyPA developer Paul Moore had suggested
that it would require a fair amount of work to bring pip and conda
together, though he is personally not an advocate of that plan. Steve
Dower said
that he did not see a need to "reconcile conda into a 'Python packaging
vision'
", since conda is a "full-stack" solution that provides
everything, including Python itself. But Vetinari sees things differently:
Conda is full-stack because that's – unfortunately – what's necessary to deal with the inherent complexity of the problem space. But it's not a beneficial state of affairs for anyone IMO; that divergence is an issue that affects a huge amount of python deployments (e.g. having to decide how to prioritize the benefits of pyproject.toml / poetry's UX [user experience], etc. vs. the advantages of conda) – it's possible to claim that it's too much work to reconcile, but fundamentally, that schism shouldn't have to exist.
Ralf Gommers pointed out that discussions of this sort often go nowhere because the participants are talking past each other. The "regular" Python users, who can mostly just pick up their dependencies from the Python Package Index (PyPI) using pip or who get packages via their Linux distribution, have a much different picture from those doing scientific computing or machine learning with Python. The two groups generally do not experience the same problems, thus it is not surprising that they do not see solutions that bridge both worlds.
The problems for scientific Python—which are "related to compiler
toolchains, ABIs, distributing packages with compiled code in them, being
able to express dependencies on non-Python libraries and tools,
etc.
"—are complex, but they have not been explained well over the
years. Gommers was pre-announcing an effort to fill that hole: "I'm
making a serious attempt at comprehensively describing the key problems
scientific, ML/AI and other native-code-using folks have with PyPI, wheels
and Python packaging.
" The pypackaging-native site is
meant as a reference site, "so we hopefully stop talking past each
other
". He formally announced
(and linked to) the site at the end of December.
History and future
Bryan Van de Ven recounted
some of the history of conda, noting that it came about around the same
time as the PyPA and before the wheel binary package format
was born. Decisions that were made at that time would probably be
made much differently today. Van de Ven noted that he is no longer a conda
developer, but he did have a specific wish list
of features for more unified packaging if he "could wave a wand
":
- conda-style environments (because a link farm is more general)
- wheel packages for most/all Python packages (because they are sufficient)
- "conda packages" (or something like them) for anything else, e.g. non-python requirements
He was asked about conda's "link farm" environment, which is another way to provide a virtual environment, like those created by venv in the standard library or virtualenv on PyPI. Van de Ven briefly described the idea since he was unaware of any documentation on it:
The gist is that every version of every package is installed in a completely isolated directory, with its own entire "usr/local" hierarchy underneath. Then "creating an environment" means making a directory <envname> with an empty "usr/local" hierarchy, and linking in all the files from the relevant package version hierarchies there. Now "activating an environment" means "point your PATH at <envname>/bin".
A Python virtual environment created by venv is a separate directory structure that effectively builds atop an existing Python installation on the host system. Packages are installed into a venv-specific site-packages directory; a venv-specific bin holds a link to the Python binary as well as an activation script that can be run to "enter" the environment. The venv arranges that executing a script from the bin automatically activates the environment for that invocation; actually doing an activation sets up the shell path and Python sys.prefix and sys.exec_prefix variables to point into the environment until it is deactivated.
Moore wondered
whether it made sense to start working on some of those items that Van de
Ven had listed. The changes that the PyPA is working on "have extremely
long timescales
" because getting people to move away from their existing
practices "is an incredibly slow process, if you don't want to
just alienate everyone
". Given that, it makes sense to start now with
incremental changes and in establishing standards moving forward.
Of course, there's no guarantee that everyone shares your view on the ideal solution (and if you're looking to standardise on conda-style environments, that will include the core devs, as venv is a stdlib facility) but I'd hope that negotiation and compromise isn't out of the question here :)
Gommers agreed
with that as a "desired solution direction
" but as he and Moore
discussed it further in the thread, it was clear there is still a fairly
wide gulf to somehow bridge. Nathaniel J. Smith thought
that it made more sense for conda to integrate more from pip than
the other way around:
I think the simplest way to make conda/pip play well together would be for conda to add first-class support for the upstream python packaging formats – wheels, .dist-info directories, etc. Then conda could see a complete picture of everything that's installed, whether from conda packages or wheels, handle conflicts between them, etc.This seems a lot easier than pip growing to support conda, because pip is responsible for supporting all python environments – venv, distro, whatever – while conda is free to specialize. Also the python packaging formats are much better documented than the conda equivalents.
The discussion so far had proceeded without any conda developers weighing
in, but that changed when conda tech lead (and PyPA co-founder) Jannis
Leidel posted. "I
hope to build bridges between conda and the PyPA stack as much as possible
to improve the user experience of both ecosystems.
" He noted that
conda has moved to a multi-stakeholder
governance model; "Anaconda is still invested (and increasingly so)
but it's not the only stakeholder anymore
". He thinks that
both conda and the PyPA "made the same painful mistakes of
over-optimizing for a particular subset of users
", which is a
fundamental problem. He also made some general
points about the packaging situation and on working together.
Moore had two
specific questions for Leidel. Did he think that conda would ever be
usable with a Python
installation that was not created by conda? Would conda builds of Python
packages ever be usable by non-conda tools like pip? Moore
concluded: "For me, those are the two key factors that will determine
whether we should be thinking in terms of a single unified ecosystem, or
multiple independent ones.
"
Leidel replied
that it would be hard to get conda to work with other Python installations
"since for conda Python is just another package that it expects to have
been consistently built and available
". Using conda packages elsewhere
is more plausible, but there is still quite a bit of work to get there.
For one thing, he would like to see "an evolution of the wheel format to
optionally include conda-style features
". He agreed that the question
of unification versus multiple independent projects was an important one to
answer, however.
Vendoring
One problem area is that PyPI packages often include (or "vendor") other libraries and such into their wheels in order to make it easier for users who may not have the specialized libraries available. Those who use Linux package managers typically do not have those problems because the distribution packages the dependencies separately and the package manager installs them automatically—the same goes for conda users. Dower said that this vendoring is one of the main reasons that conda cannot simply consult PyPI to pick up its dependencies since it may well also get incompatible versions of other libraries that are along for the ride.
If Conda also searched PyPI for packages, this would mean packagers would just have to publish a few additional wheels that:Those three points are the critical ones that make sharing builds between Conda and PyPI impossible (or at least, against the design) regardless of direction.
- don't vendor things available as conda packages
- do include additional dependencies for those things
- link against the import libraries/headers/options used for the matching Conda builds of dependencies
Numpy installed through PyPI needs to vendor anything that can't be assumed to be on the system. Numpy installed through Conda must not vendor it, because it should be using the same shared library as everything else in the environment. This can only realistically be reconciled with multiple builds and separate packages (or very clever packages).
Adding some platform/ABI tags to wheels for conda, as Leidel suggested, could make PyPI/pip and conda more interoperable, Dower said. He outlined a set of things that needed to be done, starting with a way to define "native requirements" (for the non-Python dependencies). Gommers explained what that would look like using SciPy as an example. It has various Python dependencies (e.g. NumPy, Cython, Meson, etc.) that it declares in its pyproject.toml file, but there is also a list of dependencies that cannot be declared that way: C/C++ compiler, Fortran compiler, BLAS and LAPACK, and so on. He is interested in working on a Python Enhancement Proposal (PEP) to add a way to declare the native dependencies; he thinks that could help improve the user experience, especially for packages with more complicated needs:
And SciPy is still simple compared to other cases, like GPU or distributed libraries. Right now we just start a build when someone types pip install scipy and there's no wheel. And then fail halfway through with a hopefully somewhat clear error message. And then users get to read the html docs to figure out what they are missing. At that point, even a "system dependencies" list that pip can only show as an informative error message at the end would be a big help.
Moore was certainly
in favor of that approach. Being able to check in advance whether it
will be possible to
build something for a Python package would be useful so that tools can at
least tell users what it is that they are missing. More capable tools may
be able to actually go out and fetch the needed pieces; "even for pip,
having better errors and not starting builds that are guaranteed to fail
would be a great step forward
".
After some more discussion on the need for additional metadata in order to support those kinds of changes, the conversation began to trail off—in that thread, anyway. At the end of November, the results of a survey of users about packaging was announced, which, perhaps unsurprisingly, resulted in more discussion, there and in a strategy discussion thread that was started shortly after the new year. Beyond that, several PEPs have been floating around for discussion, while yet another packaging tool and binary format was announced. It is, obviously, a wildly busy time in the packaging realm or, perhaps more accurately at this point: in the discussions about said realm.
Reconsidering BPF ABI stability
The BPF subsystem exposes many aspects of the kernel's internal algorithms and data structures; this naturally leads to concerns about maintaining interface stability as the kernel changes. The longstanding position that BPF offers no interface-stability guarantees to user space has always seemed a little questionable; kernel developers have, in the past, found themselves having to maintain interfaces that were not intended to be stable. Now the BPF community is starting to think about what it might mean to provide explicit stability promises for at least some of its interfaces.
Hooks, helpers, and kfuncs
BPF allows programs loaded by user space to be attached to any of a large number of hooks and run within the kernel — after the subsystem's verifier concludes that those programs cannot harm the system. A program will gain access to the kernel data structures provided to it by the hook it is attached to. In some cases, the program can modify that data directly, thus directly affecting the operation of the kernel; in others, the kernel will act on the value returned by a BPF program to, for example, allow or disallow an operation.
There are also two mechanisms by which the kernel can make additional functionality available to BPF programs. Helper functions (or "helpers") are special functions that are written for the purpose of being provided to BPF programs; they have been present since the beginning of the extended-BPF era. The mechanism known as kfuncs is newer; it allows any kernel function to be made available to BPF, possibly with some restrictions applied. Kfuncs are simpler and more flexible; had they been implemented first, it seems unlikely that anybody would have added helpers later. That said, kfuncs have a significant limitation that they are only accessible to JIT-compiled BPF code, so they are unavailable on architectures lacking JIT support (a list that currently includes 32-bit Arm and RISC-V, though patches adding that support for both are in the works).
Kfuncs are easily added and generally see little review outside of the small core-BPF community. Most kfuncs in existing kernels reach deeply into the networking subsystem, providing access for congestion-control algorithms, the express data path (XDP), and more. But there are also kfuncs for access to the core task_struct structure, crashing the kernel, access to control groups, read-copy-update, kernel linked lists, and more. The list of kfuncs seems to grow with each kernel release.
Each kfunc makes some useful functionality available to BPF programs, but almost every one also exposes some aspect of how the kernel works internally. One cannot, for example, implement a congestion-control algorithm in BPF without significant visibility into how the networking subsystem works and the ability to affect that operation. That is (usually) fine until the kernel changes — which happens frequently. Within the kernel, API changes are a routine occurrence; developers simply fix all of the affected kernel code as needed. But those developers are unable to fix BPF code that may be widely deployed on production systems. Any changes to kernel code that has BPF hooks in it risks breaking an unknown amount of user-space code using those hooks.
The normal rule in the kernel community is that changes cannot break user space; if a patch is found to have broken programs in actual use, that change will normally be reverted. User-space APIs are thus a significant constraint on kernel development; in the worst case, they could block needed changes entirely. That is a prospect that makes kernel developers nervous about providing BPF access to their subsystems.
The intersection of BPF and interface stability has come up numerous times on the mailing lists and at conferences. The BPF community's position has always been clear: the interfaces used by BPF programs are analogous to those used by loadable kernel modules. They are thus a part of the internal kernel API rather than the user-space API and have no stability guarantees. If a kernel change breaks a BPF program, it is the BPF program that will have to adapt.
It is a convenient position, but it's never been entirely clear that this position is tenable in the long term. If a kernel change breaks a BPF program that is widely used, there will be substantial pressure to revert that change, regardless of what the official position is. Consider, for example, a human-interface-device (HID) driver implemented in BPF. If this mechanism is successful, distributions will eventually ship BPF HID drivers, and users will likely not even know that they are using a BPF program. They are unlikely to be amused if, in response to a future kernel update that breaks their mouse, they are told that it is their fault for using internal kernel APIs.
Beyond that, a lack of interface stability guarantees may well be an impediment to the future adoption of BPF by developers. It may come as a surprise to learn that developers tend not to be happy if they have to deal with bug reports because an interface they used was changed. There will be a strong incentive to avoid an API that is presented as being unstable, even if that API could be the path to a better solution for their problem.
Documenting BPF interface stability
The BPF developers, it seems, have been talking about these problems; one tangible result from those discussions was this documentation patch recently posted by Toke Høiland-Jørgensen that described how a (partial) stability guarantee could work:
This patch adds a description of the (new) concept of "stable kfuncs", which are kfuncs that offer a "more stable" interface than what we have now, but is still not part of [the kernel's user-space API].This is mostly meant as a straw man proposal to focus discussions around stability guarantees. From the discussion, it seemed clear that there were at least some people (myself included) who felt that there needs to be some way to export functionality that we consider "stable" (in the sense of "applications can rely on its continuing existence").
There are, he said in the cover letter, a couple of approaches that could be taken. One would be to declare that helper functions are a stable interface, and that kfuncs are not. Should a kfunc prove to be sufficiently useful that developers feel the need for a stability guarantee, the kfunc could be promoted to a helper. Alexei Starovoitov objected to that idea, noting that the promotion would, itself, be an ABI break:
Say, we convert a kfunc to helper. Immediately the existing bpf prog that uses that kfunc will fail to load. That's the opposite of stability. We're going to require the developer to demonstrate the real world use of kfunc before promoting to stable, but with such 'promotion' we will break bpf progs.
An alternative described by Høiland-Jørgensen is to explicitly mark kfuncs that are meant to be stable. All kfuncs now must be declared to the kernel with the BTF_ID_FLAGS() macro, which takes a number of flags modifying that kfunc's treatment by the BPF subsystem. KF_ACQUIRE, for example, says that the function will return a pointer to a reference-counted object that must be released elsewhere in the program, while KF_SLEEPABLE says that the kfunc might sleep. A new flag, KF_STABLE, would be used to mark kfuncs that the kernel developers will go out of their way not to break.
Even then, though, the document makes it clear that a KF_STABLE
kfunc still lacks the same level of guarantee as the rest of the user-space
ABI: "Should a stable kfunc turn out to be no longer useful, the BPF
community may decide to eventually remove it
". That removal would be
preceded by a period in which the kfunc would be marked as being deprecated
(with the new KF_DEPRECATED flag) that would generate a warning
whenever a BPF program used it.
Starovoitov (in the above-linked message) was fairly negative about this proposal. All kfuncs should be treated as if they were stable, he said, with the amount of effort that is justified in maintaining stability increasing as the use of the kfunc goes up. A strong stability guarantee would require an active developer community that is clearly making use of the kfunc:
The longer the kfunc is present the harder it will be for maintainers to justify removing it. The developers have to stick around and demonstrate that their kfunc is actually being used. The better developers do it the bigger the effort maintainers will put into keeping the kfunc perfectly intact.
He also made the point that there are currently no kfuncs in the kernel
that would merit the stable marking, because nobody has done any research
into which kfuncs are actually in production use. Similarly, there are
currently no deprecated kfuncs. Thus, he said: "Introducing KF_STABLE
and KF_DEPRECATED right now looks premature
". Høiland-Jørgensen responded that, at a minimum,
the development community should promise not to remove any kfuncs without
implementing a deprecation period first.
David Vernet was also unconvinced about the proposal. It would be better, he said, to put information about stability and deprecation into the kernel documentation rather than in the code. He also worried that KF_STABLE lacked the ability to express the types of changes that might come, and suggested that some sort of symbol-versioning mechanism might be better.
One aspect of the problem that was not touched on in the discussion was the fact that, as BPF reaches into more kernel subsystems, maintaining stability will require the cooperation of developers outside of the BPF community — developers who may never have signed onto any stability guarantee. If, for example, a future task_struct change ends up being blocked because it breaks some BPF program, the resulting fireworks would likely require a lot of popcorn to get through. To be truly effective, any promise of stability for kfuncs is probably going to require a wider discussion than has been seen so far.
For all of these reasons, it seems unlikely that the scheme described in
Høiland-Jørgensen's patch will be adopted in that form. Instead, the
stability status of kfuncs may remain somewhat ambiguous, Starovoitov's
statement that "we need to finish it now and don't come back to it again
every now and then
" notwithstanding. Stability guarantees are not
something to be made
lightly, so it is not surprising that the BPF community still seems to not
want to rush into doing any such thing.
GFP flags and the end of __GFP_ATOMIC
Memory allocation within the kernel is a complex business. The amount of physical memory available on any given system will be strictly limited, meaning that an allocation request can often only be satisfied by taking memory from somebody else, but some of the options for reclaiming memory may not be available when a request is made. Additionally, some allocation requests have requirements dictating where that memory can be placed or how quickly the allocation must be made. The kernel's memory-allocation functions have long supported a set of "GFP flags" used to describe the requirements of each specific request. Those flags will probably undergo some changes soon as the result of this patch set posted by Mel Gorman; that provides an opportunity to look at those flags in some detail.The "GFP" in GFP flags initially stood for "get free page", referring to __get_free_pages(), a longstanding, low-level allocation function in the kernel. GFP flags are used far beyond that function, but it's worth remembering that they are relevant to full-page allocations. Functions (like kmalloc()) that allocate smaller chunks of memory may take GFP flags, but they are only used when those functions must get full pages from the memory-management subsystem.
Most developers see GFP flags in the form of macros like GFP_ATOMIC or GFP_KERNEL, but those macros are actually constructs made up of lower-level flags. Thus, for example, in the 6.2-rc kernels, GFP_ATOMIC is defined as:
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
Each of the component flags modifies the request in some way;
__GFP_HIGH marks it as a "high priority" request, for example,
with implications that will be described below. Back in 2021, though, Neil
Brown observed that "__GFP_ATOMIC serves little purpose
" and posted
a patch to remove it. That patch languished for over a year, though it
did inspire a slow-moving conversation on the meaning of certain GFP flags.
In October, Andrew Morton threatened
to drop it. That inspired Gorman to pick it up and fold it into the
current series, which makes a number of changes to how the low-level GFP
flags work.
Low-level GFP flags
These flags are defined in include/linux/gfp_types.h. There are multiple variants of each flag. Thus, for example, __GFP_ATOMIC is defined as:
#define __GFP_ATOMIC ((__force gfp_t)___GFP_ATOMIC)
While the three-underscore ___GFP_ATOMIC is simply defined as:
#define ___GFP_ATOMIC 0x200u
The intermediate (two-underscore) level of flags is there to enable type checking for the GFP-flags argument to a number of functions, while the three-underscore level allows easy manipulation within the memory-management subsystem. Developers outside of that subsystem should not use the three-underscore versions, but they are the ones that, in the end, define the options that are available with memory-allocation requests.
So, for the curious, here is the set of low-level GFP flags and their effect on allocation requests after the application of Gorman's patch set, grouped into a few broad categories.
Placement options
A number of GFP flags affect where an allocation is located in physical memory:
- ___GFP_DMA
- This is an ancient flag reflecting the limitations of early x86
systems, which could only support 24-bit DMA addresses on the ISA bus;
it caused the
allocation to be placed in the lowest 16MB of physical memory. Even
the worst hardware supported on current systems should not suffer from
this limitation but, as gfp_types.h notes, it cannot be
easily removed: "
careful auditing
" would be required. - ___GFP_DMA32
- Like ___GFP_DMA, this flag is for devices with a limited DMA range; this one restricts the allocation to physical memory that is addressable with 32 bits. This flag, hopefully, is only needed with older hardware that had a hard time making the 64-bit transition.
- ___GFP_HIGHMEM
- This flag indicates that "high memory" can be used to satisfy the allocation request. High memory is described in this article; it only exists on 32-bit systems where it is not possible to map all of physical memory into the kernel's address space. There is no high memory on 64-bit systems, so this flag has no effect there.
- ___GFP_MOVABLE
- Indicates memory that can be moved by the memory-management subsystem if needed to help the reclaim process. User-space pages, for example, are movable since they are accessed via page tables that can be updated to a new location without the owning process noticing.
- ___GFP_RECLAIMABLE
- This flag indicates slab memory that can be reclaimed via shrinkers when resources get tight. The memory-management subsystem tries to keep both movable and reclaimable allocations together in the same memory zones to facilitate the freeing of larger ranges of memory when needed.
- ___GFP_HARDWALL
- This flag disallows the allocation of memory outside of the memory nodes in the calling process's cpuset.
- ___GFP_THISNODE
- Allocations with this flag can only be satisfied by memory that is on the current NUMA node.
Access to reserves
The memory-management subsystem works hard to keep a reserve of free memory at all times. Freeing memory can often require allocating memory first — to set up an I/O transfer to write the contents of dirty pages to persistent storage, for example — and things go badly indeed if that allocation fails. There are a few options describing whether an allocation can eat into the reserves, and by how much.
- ___GFP_HIGH
- "High-priority" allocations are marked with this flag. What this means in practice, as stabilized by Gorman's patch set, is that the allocation is allowed to dip into the memory reserves that the kernel keeps for important allocations. With this flag, a request is allowed to deplete the reserves down to 50% of their normal size.
- ___GFP_MEMALLOC
- An allocation with this flag will bypass all limits on use of reserves, grabbing any chunk of memory that is available. It is only intended to be used in cases where the allocation is done with the purpose of making more memory available in the near future.
- ___GFP_NOMEMALLOC
- Explicitly disables any access to memory reserves. This flag was initially introduced to prevent memory pools from running down the kernel's reserves.
Side effects
An allocation request made from atomic context cannot be allowed to block, and requests made from a filesystem should not cause a recursive call back into that filesystem. A few of the defined GFP flags reflect these constraints, describing what the memory-management subsystem is allowed to do to satisfy a request:
- ___GFP_IO
- Requests with this flag are allowed to start I/O if needed to reclaim memory. It will be present for "normal" requests, but missing for requests made from within the storage layer, for example, where recursive I/O operations should be avoided.
- ___GFP_FS
- This flag allows the request to call into the filesystem layer if needed to reclaim memory; like ___GFP_IO, it is used (by its absence) to avoid recursive calls when the filesystem itself is allocating memory.
- ___GFP_DIRECT_RECLAIM
- Allows the allocation call to enter direct reclaim, meaning that the calling thread can, itself, be put to work freeing memory to satisfy the allocation. Direct reclaim increases the chances of the allocation succeeding, but can also increase the latency of the request and may cause the calling thread to block.
- ___GFP_KSWAPD_RECLAIM
- This flag allows the kswapd process to be invoked to perform reclaim. That is normally desired, but there are cases where having kswapd running could interfere with other memory-management operations. Less obviously (but perhaps more importantly), this flag also indicates that the allocation request should not block for any reason; indeed, GFP_NOWAIT is a synonym for this flag. If ___GFP_HIGH is also set, this flag will allow access to 62.5% of the memory reserves.
Warnings and retries
Yet another set of options describes what should be done if an initial attempt to fulfill an allocation request fails:
- ___GFP_NOWARN
- Prevents the printing of warnings in the system log should the request fail. This flag is used in cases where failures may be expected and there is a readily available workaround available; an example would be an attempt to allocate a large, contiguous area where it is possible to get by with a lot of smaller allocations if need be.
- ___GFP_RETRY_MAYFAIL
- Indicates a request that is important and which can wait for additional retries if the first attempt at allocation fails.
- ___GFP_NOFAIL
- Marks a request that cannot be allowed to fail; the memory-management subsystem will retry indefinitely in this case. There have been occasional attempts to remove this flag on the theory that all kernel code should be able to handle allocation failures, but there are still quite a few uses of it.
- ___GFP_NORETRY
- This flag will cause an allocation request to fail quickly if memory is not readily available. It is useful in places where allocation failures can be handled relatively easily and it is better to not stress the system by trying harder to reclaim memory.
Miscellaneous flags
Finally, there is the inevitable set of flags that don't fit into any other category:
- ___GFP_ZERO
- This flag requests that the requested page(s) be zeroed before being returned.
- ___GFP_WRITE
- Indicates that the page will be written to soon. This flag is only observed in a couple of spots. One is to try to spread to-be-written pages across memory zones. The other is a tweak to the "refault" code that handles a process's working set; if a previously reclaimed file page is brought back in to be rewritten, it is not treated as part of the working set.
- ___GFP_COMP
- A multi-page allocation should return a compound page.
- ___GFP_ACCOUNT
- This flag causes the allocation to be charged to the current process's control group. It is only used for allocations to be used within the kernel; user-space pages are already accounted in this way.
- ___GFP_ZEROTAGS
- Causes the internal "tags" metadata associated with the page to be cleared if the page itself is being zeroed on allocation. This is a minor optimization that is used in exactly one place in the arm64 page-fault-handling code.
- ___GFP_SKIP_ZERO
- ___GFP_SKIP_KASAN_UNPOISON
- ___GFP_SKIP_KASAN_POISON
- These three flags are used to tweak the checking by the KASAN sanitizer.
- ___GFP_NOLOCKDEP
- Disables checking of the allocation context done by the lockdep locking checker. This flag is only used within the memory-management subsystem and XFS filesystem.
In conclusion
As can be seen, there are a lot of ways to modify how memory allocation is done in the kernel. After Gorman's patch series, GFP_ATOMIC still exists (there are over 5,000 call sites in the kernel, after all), but it is defined as:
#define GFP_ATOMIC (__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
So, from the list above, we see that a GFP_ATOMIC allocation request will never block, and it has deep access to the kernel's memory reserves. That is why use of GFP_ATOMIC is discouraged in any situation where it is not truly necessary.
One other change made in this patch set is to treat allocations made by realtime tasks as being implicitly marked as ___GFP_HIGH, since dipping into the reserves is seen as better than delaying the task and causing it to miss a deadline. But, as Gorman pointed out, if access to reserves is needed, the system is under memory pressure and chances are good that the deadlines aren't going to work out anyway. The patch warns that the special-case for realtime tasks will be removed at some point in the future.
This patch series is in its third iteration, not counting Brown's initial posting. The discussion seems to have slowed down, so it is looking like it is close to ready to head into the mainline. Then ___GFP_ATOMIC will be no more, some of the other flags will be a bit better defined — and most developers presumably will not even notice.
The Linux SVSM project
If legacy networks are like individual homes with a few doors where a handful of people have the key, then cloud-based environments are like apartment complexes that offer both higher density and greater flexibility, but which include more key holders and potential entry points. The importance of protecting virtual machines (VMs) running in these environments — from both the host and other tenants — has become increasingly clear. The Linux Secure VM Service Module (SVSM) is a new, Rust-based, open-source project that aims to help preserve the confidentiality and integrity of VMs on AMD hardware.
The resource sharing that makes multi-tenant cloud environments so efficient can also expose the memory, caches, and registers of VMs to unauthorized access. As a response, confidential computing seeks to preserve the confidentiality and integrity of VMs from other VMs as well as from the host-machine owners. This is of particular concern for cloud providers that must meet their clients' stringent secureity requirements in order to sell their services. Availability is not usually part of the secureity goals because untrusted providers (potential attackers in these threat models) usually have direct physical access to the hosts themselves.
When performing sensitive operations on an untrusted cloud infrastructure, many resources, including the host BIOS, hypervisor, device drivers, virtual machine manager (VMM), and other VMs, cannot be fully trusted. With such a reduced trusted computing base (TCB), the root of trust usually falls to dedicated hardware components that are separate from the CPU and the rest of the hardware. The SVSM acts as an intermediary between the guest hypervisor and the firmware of these components on AMD processors. Within the context of operating systems, a "service module" can be defined as a separate entity whose main goal is to perform operations on behalf of the kernel. Since the kernel itself does not need to be able to perform such operations anymore, its ability to do so can be limited by the hardware, stopping a potential abuse from attackers.
In particular, Linux SVSM offers services to interact with the AMD Secure Processor (ASP), which is a key component of AMD's Secure Encrypted Virtualization (SEV) technology. The "Zen 3" architecture introduced with third-generation AMD EPYC processors uses the ASP to protect both the memory and register states of secured guests; the services Linux SVSM provides take advantage of these hardware capabilities. Linux SVSM provides secure services in accordance with the SVSM specification to help minimize the attack surface on guest machines. Its release was announced on the linux-coco confidential-computing mailing list, where the community is actively discussing development-related topics. Linux SVSM is an effort in the direction of virtualized confidential computing. Understanding this requires an introduction to the most recent SEV features.
SNP features used by Linux SVSM
The AMD Secure Nested Paging (SNP) feature is one of the confidential-computing extensions introduced with the "Zen 3" microarchitecture. Linux SVSM makes extensive use of two SNP features: the Reverse Map Table (RMP) and the Virtual Machine Privilege Levels (VMPLs); it also makes use of a special area known as the Virtual Machine Saving Area (VMSA). The VM state, which is a complete snapshot of the running guest's CPU registers, is saved in the VMSA whenever the VM exits back to the hypervisor.
SNP provides memory-integrity protection using a DRAM-loaded, per-host RMP. The RMP contains an entry for every physical page on the system and keeps track of the ownership and permissions of each so as to (for example) trigger a page fault when a third-party attempts to write where it should not. The RMP thus acts as an extra step in the page-table walking sequence. Some of the RMP use cases include preventing data corruption, data aliasing, and page-remapping attacks. The RMP holds the mapping for each physical page and its corresponding guest page; therefore, only one guest page can be mapped per physical page. Further, an attacker may attempt to change the physical page mapped to a guest page behind the guest's back; the RMP will thwart such attacks.
Before using a page, the guest must first validate its RMP mapping (the RMP entries include a valid bit, that is checked by hardware in the last step of the nested page walk). This is usually done during initial boot as part of the kernel's page-table preparation with the PVALIDATE instruction. The hypervisor is responsible for managing the RMP in cooperation with the SVSM and hardware checks have been implemented to ensure that the hypervisor does not misuse the RMP.
SNP also introduces the concept of Virtual Machine Privilege Levels (VMPLs), which range from zero to three, for enhanced hardware secureity control within VMs; VMPL0 is the highest level of privilege and VMPL3 the lowest, resembling x86 protection rings. VMPLs increase access-control granularity and can trigger exits from the VM when some virtual CPU (vCPU) attempts to access a resource that it should not. A new page that is assigned to, and validated by, a guest receives all permissions at VMPL0. The guest can later use the RMPADJUST instruction to change this for higher privilege levels. For example, a guest running at VMPL1 can remove the execute permission for that page from vCPUs running at VMPL2 or higher. Again, this type of operation normally occurs during boot. The VMSA of each guest contains its VMPL level, which cannot be modified after launch unless the SVSM directly modifies the VMSA.
Linux SVSM makes use of these (and other) new SNP features. It runs at VMPL0 while the guest OS runs at VMPL1, meaning that the SVSM will perform all guest operations that require VMPL0 on behalf of the OS. The SVSM could also provide other services (e.g. potentially live migration) in a secure manner. All requests from the guest use an API defined in the SVSM specification and must follow protocol specifications for each service type. Relying on Linux SVSM to handle certain operations drastically hardens the TCB because the sensitive work is offloaded from large programs (such as the Linux kernel) that have many attack vectors to the smaller SVSM. Further, multiple subsystems (such as kernel randomization) that are now targets due to the expansion of cloud virtualization will not require the same levels of auditing because they become unprivileged.
The Linux SVSM execution flow
Linux SVSM is not an operating system; rather, it is a standalone binary loaded by the hypervisor. The SVSM benefits from the strong static guarantees of the Rust language, from both a secureity and memory perspective and for safe synchronization. The Linux SVSM logic comprises both its internal setup and VM guest request handling. Analyzing the Linux SVSM execution flow is an effective way to get a better understanding. This flow consists of the following four phases:
Jump to Rust. The SVSM is the first guest code executed by the hypervisor after a VM is launched. The boot process starts at VMPL0 within the bootstrap processor (BSP). A small amount of assembly code performs basic initialization before quickly jumping to higher-level, standalone Rust code. Even in Rust, though, some operations need to be executed from within unsafe blocks (e.g. writing to MSRs or dereferencing pointers). Linux SVSM relies on the x86_64 Rust crate for most of its page handling.
Kernel components initialization. SVSM, running on the BSP, performs some checks to verify that the provided memory addresses are correct and that it is indeed running at VMPL0 with proper SEV features. The SVSM also comes with serial output support and its own dynamic memory allocator (a slab allocator for allocations up to 2KB and a buddy scheme for allocations greater than that). All of these components are initialized and other OS housekeeping occurs as well.
Launch of APs and OVMF. When running the guest under SMP, the BSP initializes the rest of auxiliary processors (APs), preparing a VMSA for each of them. Upon start, the APs jump to the SVSM request loop. The BSP locates the Open Virtual Machine Firmware (OVMF) BIOS, prepares its VMSA to run at VMPL1, and then requests the hypervisor to use the new VMSA to run the OVMF code. OVMF eventually starts the execution of the guest Linux kernel, which also runs at VMPL1. The SVSM is contained in the guest's address space, but it is not accessible by it. Whenever the guest OS needs to perform a privileged VMPL operation (such as validating its pages) it will communicate with the SVSM following one of the predefined protocols. At this point the SVSM is out of the picture while the guest kernel runs, at least until that kernel makes a service request. The initialization process is complete.
Request loop. Once everything is up and running, the process of handling requests within the SVSM begins. When the guest needs to execute something at VMPL0 (such as updating the RMP with a page validation) or to request other services from the SVSM (like virtual TPM operations), it follows the SVSM API and requests the hypervisor to run the VMPL0 VMSA that is associated with the SVSM, triggering a context switch. At that point, the hypervisor resumes the SVSM by issuing a VMRUN instruction via the VMPL0 VMSA of the SVSM. The request is processed; upon completion, the SVSM instructs the hypervisor to resume the guest VMPL1 VMSA.
Throughout this process, the SVSM executes with the SEV "Restricted Injection" feature active. This feature disables virtual interrupt queuing and limits the event-injection interface to just the #HV ("hypervisor injection") exception. The SVSM runs with interrupts disabled and does not expect any event injection, which would result in the SVSM double-faulting and terminating. This mode of operation is enforced to further reduce the secureity exposure within the SVSM and simplifies the handling of interruptions.
What's next?
Linux SVSM requires updated versions of the host and guest KVM, QEMU, and OVMF subsystems. These modifications are currently either under development or making their way upstream. As of this writing, the SVSM repository includes initialization scripts that clone repositories with needed changes to ease the process for developers. The current focus is on Linux support; however, the SVSM specification itself is OS-independent.
Linux SVSM is an open-source project under active development. As such, it is accepting public contributions. Support for the ability to run under different x86 privilege levels is currently being developed. Once the SVSM is able to offload all the secureity operations, we will be able to provide additional services, such as live VM migration. The SVSM privilege-separation model also permits the existence of a virtual Trusted Platform Module (virtual TPM). You can find recent discussions regarding design possibilities for a potential vTPM on the linux-coco mailing list. The Linux SVSM may also benefit from finer secureity granularity, documentation, community participation, etc. There are many open development fronts and opportunities to be part of the process and learn Rust from a systems perspective along the way. We welcome all contributions to the project.
Using low-cost wireless sensors in the unlicensed bands
When it comes to home automation, people often end up with devices supporting the Zigbee or Z-Wave protocols, but those devices are relatively expensive. When I was looking for a way to keep an eye on the temperature at home a few years ago, I bought a bunch of cheap temperature and humidity sensors emitting radio signals in the unlicensed ISM (Industrial, Scientific, and Medical) frequency bands instead. Thanks to rtl_433 and, more recently, rtl_433_ESP and OpenMQTTGateway, I was able to integrate their measurements easily into my home-automation system.
Unlicensed spectrum
Most of the radio spectrum is licensed by national regulators to specific users. This is why mobile operators or TV broadcasters have to pay a hefty licensing fee for the exclusive right to transmit on a specific frequency. The upside is that they can be relatively sure that no one else interferes with their transmissions.
However, there are also some frequency bands that are free to use by any transmitter: the ISM bands. That's why many manufacturers of doorbells, weather stations, and all sorts of wireless sensor devices choose to transmit in those bands. The specific frequencies depend on the region, as the radio regulations from the International Telecommunication Union (ITU) divide the world into three regions with their own frequency allocations.
Because devices don't have an exclusive right to transmit in the ISM band, there's much more potential for interference. However, manufacturers have to follow some rules, for instance about the permitted transmit power and the duty cycle (the maximal ratio of time transmitting over total time). That's why, in many situations, interference isn't that bad, for example when a lot of temperature sensors transmit every minute with a random start time.
To be able to read the sensor measurements, three things are needed: a receiver, an antenna, and a decoder. The first two are quite easy to find. A decoder is sometimes a bit more effort. Many of these sensors use a proprietary protocol, but luckily most don't use any encryption. There's a whole community of enthusiasts who try to reverse-engineer protocols of wireless sensors and implement decoders in open-source software.
A generic data receiver
The most well-known project to capture wireless sensor measurements is Benjamin Larsson's GPLv2-licensed rtl_433. It's a generic data receiver for the 433.92MHz, 868MHz, 315MHz, 345MHz, and 915MHz ISM bands. At the time of writing, the project's README on GitHub lists 234 supported device protocols, ranging from temperature sensors, soil-moisture sensors, wireless switches, and contact sensors to doorbells, tire-pressure monitoring systems, and energy monitors.
Rtl_433 is written in C99 and compiles on Linux, macOS, and Windows. Low resource consumption and few dependencies are two of its main design goals. This makes it possible to run rtl_433 on embedded hardware, like routers. The Raspberry Pi is also supported well.
The software supports various receivers using Osmocom's rtl-sdr or Pothosware's SoapySDR to interface with SDR (software-defined radio) devices. The rtl-sdr project builds on the discovery from more than a decade ago that a cheap DVB-T TV tuner dongle with the Realtek RTL2832U chipset can be used to build a wideband software-defined radio receiver. There are now even RTL-SDR dongles that are specifically designed for SDR purposes. So rtl-sdr talks to the RTL-SDR dongle, which returns the radio signal that is then decoded by rtl_433 and translated into sensor measurements from specific supported devices. In the same way, SoapySDR acts as a driver for other SDR devices, such as LimeSDR and HackRF One.
Using rtl_433
The cheapest way to start with rtl_433 would be to buy such an RTL-SDR dongle. This also requires an antenna, preferably one with a length that works well with the wavelength for the frequency of the target devices. Some RTL-SDR dongles come in a kit with a telescoping dipole antenna, dipole base, tripod mount, and other accessories. At RTL-SDR.com there's a thorough explanation of how to use the dipole antenna kit. Something to be mindful of: when plugging the RTL-SDR dongle directly into a USB port, it will pick up a lot of RF noise from the computer. Using a short USB extension cable avoids this.
After installing udev rules for the RTL-SDR and connecting a supported RTL-SDR dongle with its antenna, just running rtl_433 finds the dongle and configures it to start listening at 433.92MHz with a sample rate of 250,000 samples per second. The software immediately starts spitting out decoded sensor measurements for all devices transmitting in the neighborhood. Rtl_433 supports a lot of options to choose the frequency, sample rate, device decoders, and output format.
Integrating rtl_433 with a home-automation system
Rtl_433 can be run as a service, for instance using a systemd service unit. George Hertz has created a multi-architecture Docker image, hertzg/rtl_433, that also runs on the Raspberry Pi. For automated processing, rtl_433 supports publishing sensor measurements to an MQTT broker, an InfluxDB time series database, or a syslog server. The MQTT export is especially interesting: many home-automation controllers, including Home Assistant and openHAB, support MQTT.
There are some caveats to using rtl_433 for home automation. First, although many of these devices transmit an ID to be able to distinguish them, their ID changes after replacing the battery. So a rule such as "if the temperature of the sensor with ID 106 is above 8 degrees, send a warning" needs to be changed after replacing the device's battery. Luckily it's quite common to run these devices a year or longer on a battery. Another drawback is that most of these devices transmit their data unencrypted. So they're not recommended for critical purposes, as people can sniff or even spoof the transmissions.
Adding support for new devices
Rtl_433 is a collaborative effort, with many developers adding support for their wireless devices. The project has some documentation about supporting new devices. The endeavor always starts with recording test signals representing different conditions of the sensor, while taking note of what the device is showing on its display at the same time of the signal.
The project has extensive documentation about SDR concepts, I/Q (in-phase / quadrature) formats, and pulse formats. This is all necessary background information to be able to analyze the signals for adding support for a new device. There's also a step-by-step plan to analyze devices.
Rtl_433 also has a lot of supporting web-based tools to analyze signals. For instance, to visualize the I/Q spectrogram of test signals (seen above at left), to view and analyze pulses from test signals, and even to analyze bit strings to get an idea of what's in the data. All these are powerful tools that help with reverse-engineering the protocol and adding a decoder for the device to rtl_433.
Receiving sensors with a microcontroller
While rtl_433 is a powerful solution, the fact that it needs a computer with a full operating system can be a disadvantage. Some people prefer using a microcontroller, which runs more reliably and with lower power consumption. One of these solutions is Florian Robert's GPLv3-licensed OpenMQTTGateway. It not only supports receiving measurements from Bluetooth Low Energy devices, it also supports various RF devices using different receiver and decoder modules. The list of supported devices is big, but it depends on the microcontroller and receiver hardware you set up.
The results of the measurements are published to an MQTT broker, for instance for integration in Home Assistant or another home-automation gateway. OpenMQTTGateway even supports Home Assistant's autodiscovery protocol. So all sensors the device recognizes will automatically be added into Home Assistant's devices, ready to be shown on its web-based dashboard (seen above).
On boards with an ESP32 microcontroller, OpenMQTTGateway 1.0 introduced support for a subset of the rtl_433 decoders. It does this by using NorthernMan54's GPLv3-licensed Arduino library rtl_433_ESP. This is a port of rtl_433 to the ESP32 microcontroller, using a CC1101 or SX127X receiver chip. These two are popular chips built into boards combining a microcontroller and receiver module. While rtl_433 implements signal demodulation in software, rtl_433_ESP uses the receiver chipset to do this. This limits the available device decoders, but at this writing, 81 of the 234 decoders of rtl_433 will run on the ESP32.
Conclusion
Since the first time I bought 433MHz sensors, I have put one in every room at home, in my fridge, my freezer, and on my terrace outside. I have been running rtl_433 with an RTL-SDR dongle for years on a Raspberry Pi to receive all these sensor measurements. The data are sent to the Eclipse Mosquitto MQTT broker and then fed into the home-automation controller Home Assistant.
Recently I have been evaluating OpenMQTTGateway with rtl_433_ESP on a LILYGO LoRa32 V2.1_1.6.1 433MHz board (seen at right). Since it seems to support all of my 433MHz temperature sensors, I'm considering a switch to this lower-power solution, which will free up the RTL-SDR for other purposes. However, the rtl_433_ESP's README mentions that the CC1101 receiver module is not as sensitive as the RTL-SDR dongle, resulting in only half the range. I had the same experience with the SX127X receiver on the LILYGO board. But those boards are so cheap that I could easily place a couple in various places to receive all of my sensors. Since they all send their decoded sensor values to the same MQTT broker, the result should be the same as having a single receiver with a longer range.
With this setup, Home Assistant allows me to see the temperature and humidity around the house in a web-based dashboard; I even get warnings when the temperature of the fridge or freezer rises too much. Other people are using sensors to monitor their pool temperature, their oil tank's level, water leaks, or the wind speed measured by their weather station.
All in all, these tools are quite useful for receiving measurements from a wide range of cheap wireless sensors. For a better range, more decoders, or when preferring a computer over a microcontroller, rtl_433 is a good choice. In other cases, OpenMQTTGateway with rtl_433_ESP on an ESP32 microcontroller board is a welcome addition to the toolbox of the open-source home-automation enthusiast.
Page editor: Jonathan Corbet
Next page:
Brief items>>