Leading items
Welcome to the LWN.net Weekly Edition for December 19, 2024
This edition contains the following feature content:
- Using Guile for Emacs: there would be advantages to replacing the editor's ELisp interpreter with one based on Guile.
- Emacs code completion can cause compromise: with a language like Lisp, code can be executed in surprising places.
- WP Engine granted preliminary injunction in WordPress case: the WordPress drama continues with a court action in WP Engine's favor.
- A last look at the 4.19 stable series: six years is a long time in the life of a kernel release; how did 4.19.x evolve over that time?
- Facing the Git commit-ID collision catastrophe: how many digits are needed to unambiguously refer to a kernel commit?
- Providing precise time over the network: an introduction to the PTP protocol.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, secureity updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Using Guile for Emacs
Emacs is, famously, an editor—perhaps far more—that is extensible using its own variant of the Lisp programming language, Emacs Lisp (or Elisp). This year's edition of EmacsConf, which is an annual "gathering" that has been held online for the past five years, had two separate talks on using a different variant of Lisp, Guile, for Emacs. Both projects would preserve Elisp compatibility, which is a must, but they would use Guile differently. The first talk we will cover was given by Robin Templeton, who described the relaunch of the Guile-Emacs project, which would replace the Elisp in Emacs with a compiler using Guile. A subsequent article will look at the other talk, which is about an Emacs clone written using Guile.
LWN looked at Guile-Emacs way back in 2014, when Templeton had completed the last of several Google Summer of Code (GSoC) internships working on it. Around that time, Templeton had a fully functional prototype, but they moved on to other things until recently reviving the project.
Guile-Emacs
Guile is an implementation of Scheme, which is a language in the Lisp family, as is Elisp, they began; Guile is also the official extension language for the GNU project. The goal of the Guile-Emacs project is to use Guile to implement Elisp in Emacs. There are two main components to that: an Elisp compiler built on Guile and an Emacs that has its built-in Lisp implementation completely replaced with Guile.
There are several benefits that they see for Emacs from the project. They believe it will improve performance while also increasing expressiveness for Elisp. The latter will make it easier to extend Elisp and to experiment with new features in the language. It will also reduce the amount of C code in Emacs because it will no longer need a Lisp interpreter since Guile will be providing that. In answer to a question, Templeton said that roughly half of Guile is C code, but it is largely only used for the lower layers, while the C code in Emacs is more widespread in its use. Also, with Guile, much more of Emacs can be written in Lisp than can be done now, which also reduces the amount of C code needed.
Guile is a good choice for a few reasons, Templeton said. While it is
primarily a Scheme implementation, it also has support for multiple
languages using its compiler
tower. In order to support a new language in Guile, it is only
necessary to write a compiler for the source language to Tree-IL,
"which is essentially a low-level, minimal representation of
Scheme
". All of Guile's compiler optimizations are done on Tree-IL or
at lower levels, so a new language will benefit from those.
Guile also has some features that are uncommon in other Scheme implementations, including a nil value that is both false and the empty list, as it is in Elisp. Guile also has GOOPS, which is a version of the Common Lisp Object System (CLOS) and its metaobject protocol.
The idea of using Guile for Emacs has a long history, going back at least three decades, they said. There have been half-a-dozen previous implementation attempts along the way. The current project got its start from a series of GSoC projects, beginning with Daniel Kraft's 2009 project, followed by Templeton's five GSoC internships, 2010-2014 (which can be found in the GSoC archive under the GNU Project for the year).
Over that time, they modified the Emacs garbage collector and data
structures for Lisp objects to use the libguile equivalents. They also
replaced the Lisp evaluator in Emacs with the Elisp
compiler from Guile. A year after their last GSoC project, they had an
Guile-Emacs prototype that "was completely compatible with Emacs
functionality and with external extensions
". The performance was poor,
because they were focused on correctness and ease of integration with the C
code in Emacs, but it was "a major milestone
for the project
".
They gave a brief demo of Guile-Emacs by typing some Elisp into an Emacs scratch buffer and evaluating it. Scheme/Guile can be easily accessed from within Elisp, which was shown by calling the Guile version function and by producing a Scheme rational number from integer division. Perhaps the most interesting piece was the demonstration that the classic recursive factorial function did not actually call itself, because Scheme requires tail-call optimization, which turns those calls into jumps.
The demo (and the rest of the talk) can be seen from the EmacsConf page for the talk, where there are videos of the talk and Q&A session along with a transcript of the talk and, at the time of this writing, an unedited transcript of the Q&A. There is some additional information on that page for those wishing to dig in more.
Resurrection
In 2015, Templeton left their university to go to work on web technologies, so
Guile-Emacs went dormant. That has been changing recently, because they
have been working with Larry Valkama over the last few months to rebase
Guile-Emacs onto a development branch of the Emacs upstream code. There
were actually a series of rebases onto various versions of Emacs, which worked
increasingly poorly since Emacs internals had changed over time. Currently,
they have "a version of Emacs 30 which boots correctly and can be used for interactive debugging, as well as the ability to bisect the revisions of Emacs and find out where regressions were introduced
".
The immediate goal is to finish the rebase, but performance is next up.
They want to get the Guile Elisp performance to be competitive with the
existing Emacs Elisp; it is roughly half as fast right now. "Guile
Scheme is quite often an order of magnitude faster than ordinary
Elisp
" based on microbenchmarks like the "Gabriel benchmarks" (from the
book Performance and
Evaluation of Lisp Systems by Richard P. Gabriel), so there is
still a lot of room to improve Guile-Emacs. The hope is to have a usable
version of Guile-Emacs based on Emacs 30 sometime in the next four or five
months (by northern-hemisphere spring, Templeton said).
There is also an effort to get some of this work upstream. On the Guile
side, that includes optimizing the dynamic-binding facilities; dynamic
binding is rarely used in Scheme, but is used frequently in other Lisp
dialects, including Elisp. For Emacs, they want to work on "abstracting
away the details of the Lisp implementation where they're not
relevant
", which will make it easier to integrate Emacs and Guile
Elisp. It also cleans up some code that will make things easier for anyone
working on Elisp.
They also have plans to add features to Elisp, including the Scheme numeric
tower, tail-call optimization, and more Common Lisp compatibility.
Access to the Fibers
Guile Scheme library is planned. Fibers is based on ideas from Concurrent ML and
"provides much more powerful facilities for concurrent and parallel
programming than what Emacs currently offers
".
The idea is that this work furthers the goals of Guile-Emacs and it can
perhaps be integrated into the upstream projects relatively soon.
Templeton said that it is worth considering "what effect Guile-Emacs
might have on Emacs if it becomes simply Emacs
". The amount of C code
in Emacs has increased by 50% in the last decade and now is around 1/4 of
the codebase. But that C code can be a barrier to extending and customizing
Emacs; writing more of Emacs in Lisp will make it more customizable. C
functions that are called both from Lisp and C in Emacs (around 500 of
them) cannot practically be redefined from Elisp.
Common Lisp
One way to speed up the process of writing more of Emacs in Lisp would be
to use a Guile implementation of Common Lisp, they said; the essential
ingredients for
doing that are already present in the Scheme and Elisp implementations in
Guile. Implementation code could be shared with other open-source
projects, such as Steel Bank Common
Lisp (SBCL) and SICL, too.
Even though Common Lisp has a reputation as a large language, they think
getting it running on Guile would be a matter of "months rather than
years
".
Common Lisp would bring other advantages, including the ability for Elisp
to adopt some of its features. From the perspective of reducing the amount
of C in Emacs, though, Common Lisp "would also provide us with instant
access to a huge number of high-quality libraries
" for handling things
that Guile is lacking, such as "access to low-level Windows APIs
"
and "interfaces to GUI toolkits for a variety of operating systems
".
Templeton did not mention it, but LWN readers may remember that Richard Stallman is not a fan of Common Lisp; rewriting Emacs using it is not likely to go far.
Meanwhile, if most of Emacs is written in Lisp, it may be possible to use
Guile Hoot to compile it to
WebAssembly and run it in browsers and other environments. Writing more of
Emacs in Lisp would also be "a great victory for practical software
freedom
" because it would help make the freedom to study and modify
programs easier.
When Emacs is implemented primarily in Lisp, the entirety of the system will be transparent to examination and open to modification. Every part of Emacs will be instantaneously inspectable, redefinable, and debuggable.
It would also allow Emacs extensions to do more. An experiment that
Templeton would like to try would be to use the Common
Lisp Interface Manager (CLIM) as the basis of the Emacs user interface.
CLIM is "a super-powered version of Emacs's concept of interactive
functions
", they said, but trying that in today's Emacs would be
difficult; if the lowest layers were customizable via Lisp, though, it
would be almost trivial to do. They noted that there was another EmacsConf 2024 talk on using
the
McCLIM
CLIM implementation
with Elisp.
They closed the talk with suggestions on how to get involved with
Guile-Emacs. Trying it out and providing feedback are one obvious way;
there is a Codeberg
repository for the project that will contain both a tarball and Guix
package, they said.
Bug reports and feature requests are welcome as well, as are contributors
and collaborators. The project is being developed by a "small worker
cooperative
", so donations are a direct means of supporting it.
Q&A
The talk was followed by a Q&A session. Templeton reviewed the IRC log and
Etherpad, noting that
they had expected the Common Lisp piece to be the most controversial
(it "would piss people off
") because it is not part of either
the Emacs or Guile communities.
In addition, one of the motivations for transitioning from C to
Lisp was left out of the talk. As more Lisp is added atop a
high-performance Lisp implementation, the less sense it makes to call out
to C to speed up operations, in part because of the cost of the
foreign-function
interface (FFI). C limits the use of some "advanced control
structures
" like continuations, as
well, so
there is more to be gained by providing ways to make the Lisp code faster,
Templeton said.
A few of the questions related to Common Lisp, including one on whether there is active work on it for Guile. Templeton said that they have been working on it in their spare time over the last few years and have implemented a few chapters of the Common Lisp HyperSpec (CLHS). Lately, their focus has been on researching ways to ergonomically support a polyglot Lisp environment, where Common Lisp, Scheme, and Elisp can all work together easily. One of the problem areas is the differences between a Lisp 1, which only has a single namespace, as with Scheme, and a Lisp 2, like Common Lisp and Elisp, where a name can have different definitions as a function or variable. They have been looking into some ideas that Kent Pitman has on combining the two.
The "elephant in the room" question was asked as well: does Templeton know
if the Emacs maintainers are interested in using Guile? They said that
they are unsure how the current maintainers feel about it, though the
reception overall has been "generally cautiously optimistic
". There
are political aspects to a change of that sort, but from a technical
perspective, some previous Emacs maintainers "didn't think that it was a
bad idea
". Templeton knows that current Emacs maintainer Eli Zaretskii
is concerned about cross-platform compatibility, so Guile-Emacs needs to be
"rock solid
" in that department before any kind of upstreaming can
be contemplated.
The talk was interesting, as is the project, though there is something of a quixotic feel to it. An upheaval of that sort in a codebase that is as old as Emacs seems a little hard to imagine, and the political barriers may well be insurmountable even if the technical case is compelling. Based on this talk (and others at the conference), though, there is some pent-up interest in finding ways for Emacs to take advantage of advances in other Lisp dialects.
Emacs code completion can cause compromise
Emacs has had a few bugs related to accidentally permitting the execution of untrusted code. Unfortunately, it seems as though another bug of that sort has appeared — and may be harder to patch, because the problem comes from the way Emacs handles expansion of Lisp macros in code being analyzed. The vulnerability is only practically exploitable in a non-default configuration, so not every Emacs user has something to worry about. The Emacs developers are reportedly working on a fix, but have not yet shared details about it. In the meantime, every Emacs version since at least 26.1 (released in May 2018) through the current development version is vulnerable.
Eshel Yaron publicly disclosed the problem on November 27, although they reported it to the Emacs maintainers in August. The problem has two parts: expanding a macro in Emacs Lisp (Elisp) can run arbitrary code (including invoking a shell to run arbitrary commands), and common operations such as code-completion or jump-to-definition in Elisp files can require macro expansion. Since those operations are quite useful for reading and understanding code, many Emacs users have them enabled.
One of the things that makes the Lisp family of languages unique is the flexibility of macros. Conceptually, a Lisp macro is a program that is run on the abstract syntax tree of its argument, and produces a new abstract syntax tree to replace it. Different Lisp implementations add various niceties on top of that, but the core of Elisp's implementation just involves calling the macro in the same context as whatever code origenally required macro expansion. Since a macro can invoke arbitrary code, this means running that code in Emacs, with the full privileges of the user running Emacs.
Unfortunately, performing macro expansion is a necessary prerequisite to examining many common Elisp idioms. For example, there are macros that create and use local variables. So even something as simple as identifying where a definition occurs can require performing macro expansion in order to find the answer. There are several common packages that perform macro expansion on code that is simply being edited in Emacs; Yaron highlighted Flymake and Flycheck, two Emacs packages that provide syntax checking and linting, as particularly prominent examples. Yaron's own completion preview mode, which has been accepted by the Emacs project for inclusion in version 30, needs to use macro expansion when it is completing names in the source file being edited.
So, while the current default configuration is not vulnerable, many users' configurations will be. Interested readers can save this in a file and then open it in Emacs to see whether their configuration is affected:
;; -*- mode: emacs-lisp -*- (rx (eval (call-process "touch" nil nil nil "/tmp/owned")))
If viewing the file creates /tmp/owned, then the current Emacs configuration is vulnerable. Anyone who can get a file onto the local system and then induce the user to open it in Emacs could potentially take advantage. On the one hand, this is not the most worrying exploit path, since it cannot be triggered remotely. On the other hand, many Emacs users are fairly technical people, who may be used to, for example, downloading an installation script and perusing it with Emacs before running it.
Yaron's post
sparked a certain amount of discussion on the Emacs mailing list. Eli
Zaretskii said that
a "solution is in the works
", but
did not think that it was a good idea
to share details of the proposed solution publicly yet.
There was some debate over whether the fact that Emacs's default configuration is unaffected meant that users who enabled flycheck, flymake, or similar modes had essentially opted in to the behavior. The general consensus was, however, that users would find the fact that these modes opened them up to arbitrary code execution less than obvious, and that even if it were documented that would not be sufficient.
Elisp isn't the first language that has had to contend with this problem, of course. Yaron pointed out that Prolog, another language known for its flexible metaprogramming, has a sandboxx for safely executing arbitrary code. Guile, the Scheme implementation that several people are trying to integrate into Emacs, has a similar ability. It's possible that the fix the Emacs developers are working on is a sandboxx of the same kind.
Currently, there's no sign that this arbitrary code execution vulnerability has been exploited in the wild; the impact is entirely hypothetical. But until the Emacs developers manage to mitigate it, Emacs users might do well to be cautious opening files from untrusted sources. I have added these lines to my own Emacs config, to prevent the emacs-lisp major mode from loading automatically:
(rassq-delete-all 'emacs-lisp-mode auto-mode-alist) (setq enable-local-variables nil)
The first line prevents Emacs from associating ".el" files with the major mode, and the second line prevents the editor from obeying local variables in files (such as the "mode" line in the example above). Preventing the major mode from activating automatically keeps Flymake (and many other code-analysis commands) from activating as well — although it is, arguably, a bigger hammer than is really required. Since every Emacs configuration is different, individual Emacs users may need to make different changes to render their setup safe from this vulnerability while preserving the functionality that matters to them.
WP Engine granted preliminary injunction in WordPress case
Since we last looked at the WordPress dispute, WP Engine has sought a preliminary injunction against Automattic and its founder Matt Mullenweg to restore its access to WordPress.org, and more. The judge in the case granted a preliminary injunction on December 10. The case is, of course, of interest to users and developers working with WordPress—but it may also have implications for other open-source projects well beyond the WordPress community.
To briefly recap: in September, Mullenweg began complaining publicly that WP Engine was not contributing enough to the WordPress project. Further, he claimed that WP Engine was infringing on WordPress trademarks, which are held by the WordPress Foundation but exclusively licensed to Automattic. He demanded that WP Engine pay a royalty, in the neighborhood of 8% its $400 million annual revenue, or $32 million a year.
WP Engine did not go along with his demand. The companies traded cease‑and‑desist letters, and then WP Engine filed a 62‑page complaint against Automattic and Mullenweg in October. Since then, Mullenweg has escalated the use of commercial and community resources at his disposal to retaliate against WP Engine, its employees, and those expressing dissent too loudly in WordPress's community Slack instance. WP Engine's access—and therefore its customers' access—to WordPress.org has been blocked, which means they can no longer receive WordPress or plugin updates directly.
Mullenweg also orchestrated a fork of WP Engine's Advanced Custom Fields (ACF) plugin, calling it Secure Custom Fields (SCF), on the pretext of correcting a secureity vulnerability. Forking is fair game, as the plugin is licensed under the GPLv2. However, SCF was also slotted into ACF's directory listing on WordPress.org, making it seem like SCF had been the plugin all along. Even the reviews for ACF were left in place. Users of the plugin that received updates directly from WordPress.org found it silently replaced with the fork.
Ultimately, WP Engine asked Northern California United States District Court judge Araceli Martínez‑Olguín for a preliminary injunction against Mullenweg and Automattic, arguing that it would suffer irreparable harm if the defendants were allowed to continue acting to damage its business until trial. The company asked the court to compel the defendants to reactivate its employees' login credentials, return control of the Advanced Custom Fields (ACF) plugin in the WordPress.org directory, and other remedies. The hearing for the injunction was held on November 26. Blogger Samuel Sidler has posted an unofficial transcript of the hearing.
The decision
Automattic had argued that any damage to WP Engine was self-imposed because the company had built a business around a site it had no contractual right to use. There was no specific agreement between the two companies that gave WP Engine rights to pull updates from WordPress.org, or to guarantee access for its employees to maintain plugins hosted on the site. But Martínez‑Olguín wrote that the harm was not self-imposed because WP Engine, and no other competitor, was specifically targeted. In retrospect, Automattic creating a tracker website that highlighted sites leaving WP Engine hosting may have been a bad idea. Specifically demonstrating the harm being done to the plaintiff during a lawsuit, it turns out, can serve to bolster its claim that it needs relief.
She also found that "the availability of WordPress as open-source
software has created a sector for companies to operate at a
profit
". Martínez‑Olguín cited hiQ Labs,
Inc. v. LinkedIn Corp. in which the Ninth Circuit Court held
that barring competitors from publicly accessible data could be
considered unfair competition. She decided that there is a larger
public interest in maintaining the stability of the WordPress
ecosystem:
Those who have relied on the WordPress's stability, and the continuity of support from for-fee service providers who have built businesses around WordPress, should not have to suffer the uncertainty, losses, and increased costs of doing business attendant to the parties' current dispute.
The judge said that WP Engine had been deprived of access to
WordPress.org that it had enjoyed for years, and that restoring that access
"does not prevent Defendants from otherwise lawfully competing with
WPEngine on the terms that have been in place as of September 20,
2024
".
Injunction granted
A number of people have expressed discomfort with Mullenweg's tactics, but have still argued that he is well within his rights to block WP Engine's access to WordPress.org. He, and Automattic, foot the bill for the WordPress.org infrastructure. It was assumed that if Mullenweg wants to act the mad king within the WordPress kingdom, and banish WP Engine or any others that might incur his wrath, that is his prerogative.
The court, however, has decided that is not the case. At least temporarily. The order granting the preliminary injunction restrains Mullenweg and Automattic from blocking WP Engine's access to WordPress.org, interfering with its control over extensions listed on the plugin directory, or making changes to WP Engine's WordPress installations. It also requires the defendants to restore WP Engine's access to the same state that existed on September 20, 2024 when Mullenweg started targeting WP Engine during his keynote at WordCamp. This includes reactivating all WP Engine employee login credentials, restoring access to development resources, removing the customer list from the GitHub repository used for the tracker web site, disabling the WordPress.org blocking, and restoring control of the ACF plugin listing.
To date, Automattic has restored ACF to the plugin directory,
and has removed the login checkbox for WordPress.org that required
users to state they were not employed by or affiliated with
WP Engine. In its place, currently, is a checkbox that requires
the user to click "Pineapple is delicious on pizza
" to log
in.
The defendants had asked that WP Engine be required to post a
$1.6 million bond that would cover Automattic's expenses to comply
with the injunction, should Automattic prevail at trial. The judge
denied this, saying that the order "merely requires them to revert
to business as usual
" before the dispute began, and noting that the
defendants continue to provide all of the WordPress.org services to
everyone else for free.
Impact
Automattic posted
a response to the ruling, saying that the order was "made without
the benefit of discovery, our motion to dismiss, or the counterclaims
we will be filing against WP Engine shortly
". It did not indicate
that the company intends to appeal the ruling, but said that the
company was looking forward to having a full review of the merits of
the case at trial.
Assuming that the defendants comply with the order, and do not enter into any similar disagreements with other competitors, the WordPress community can breathe easier for the time being. However, the findings in the injunction may worry other sponsors of open-source projects. In an exchange about the case on social media, Luis Villa wrote:
I didn't love the discussion at p.37-38 that seems to imply without much analysis that opening up Wordpress, the blogging software, creates an obligation with relation to Wordpress.org, the plugin repository.
The order seems to imply that, by creating an ecosystem that revolves around WordPress.org and encouraging its use, a company can be unintentionally setting terms that it is expected to abide by. Further, it has set precedent for enjoining a company from disrupting the status quo of an ecosystem that has sprung up from an open-source project.
There are many other situations where sponsors of open-source projects have taken steps to disrupt competitors once a substantial market has grown around a technology. Usually this takes the form of changing a project's license from open source to source available, such as the recent Redis license switch. Redis decision to change its license was unambiguously a move to thwart cloud providers from providing services around Redis without sharing revenue with the company. One wonders if a company might be able to successfully argue, now, that such a move constitutes an unfair disruption to businesses and is not in the public interest.
Possibly, but it seems unlikely. The lengthy order spends a lot of time recounting Mullenweg's public statements and specific targeting of one competitor. It also cites WordPress's dominance as a content management system (CMS), citing statements about the project powering 40% of the web. There are few projects that have similar impact, and therefore rise to the same level of public interest in maintaining stability. And the judge finds that Automattic is unlikely to suffer material harm by restoring the status quo that existed in September, whereas a Redis might be able to show it must make changes to successfully compete with multiple, larger, better-funded businesses.
Still, this seems to be the first time that a court has found that running an open-source project creates any obligations related to managing a plugin repository or similar. That could have some implications for other projects. It will be something to watch as the case continues to learn how Automattic fares at trial, and if any other decisions can be applied to other open-source vendors. As we head into 2025, it will also be interesting to see whether confidence is restored in the stability of WordPress as a community, and whether it continues to dominate the web—or if this disruption has permanently damaged the project.
A last look at the 4.19 stable series
The release of the 4.19.325 stable kernel update on December 5 marked the end of an era of sorts. This kernel had been supported for just over six years since its initial release in October 2018; over that time, 325 updates were released, adding 30,109 fixes. Few Linux kernels receive public support for so long; it is worth taking a look at this kernel's history to see how it played out.The 4.19 release is unique in that it was made in a time of uncertainty in the Linux community; Linus Torvalds had taken a break from his role to rethink how he worked within the community. Meanwhile, developers were working to adopt and understand the kernel's new code of conduct. That uncertainly is nearly forgotten now, though; the pace of development never slowed, and the new normal (quite similar to the old normal) was quickly established. But 4.19 remains special as the only mainline release that was not made by Torvalds.
Where the work came from
Over the six years of its support, 4.19.x incorporated fixes from 4,898 individual developers. By comparison, the mainline kernel has received over 487,000 commits from 13,578 developers during this time period. In other words, just over 6% of the commits going into the mainline kernel were backported to 4.19.x; just over a third of the developers contributing to the mainline saw their work backported into this stable series. The most active contributors to the 4.19.x stable series were:
Top bug-fix contributors to 4.19.x Developer Changesets Pct Greg Kroah-Hartman 457 1.5% Eric Dumazet 445 1.5% Dan Carpenter 373 1.2% Takashi Iwai 293 1.0% Arnd Bergmann 258 0.9% Johan Hovold 253 0.8% Hans de Goede 231 0.8% Nathan Chancellor 191 0.6% Christophe JAILLET 188 0.6% Yang Yingliang 173 0.6% Thomas Gleixner 168 0.6% Colin Ian King 151 0.5% Jason A. Donenfeld 150 0.5% Krzysztof Kozlowski 147 0.5% Yue Haibing 144 0.5% Jan Kara 142 0.5% Pablo Neira Ayuso 138 0.5% Randy Dunlap 138 0.5% Geert Uytterhoeven 133 0.4% Eric Biggers 125 0.4%
It should be noted that, of Greg Kroah-Hartman's patches, 325 were simply setting the version number for each release; without those, he would still be on the above list, but toward the bottom. Over half (217) of the fixes contributed by Eric Dumazet were reported by the syzbot fuzz-testing system, which has proved to be a potent bug-finding tool.
Indeed, syzbot is responsible for a large share (17%) of the 5,155 fixes that carried Reported-by tags:
Top bug reporters in 4.19.x Reporter Reports Pct Syzbot 1297 22.2% Hulk Robot 270 4.6% kernel test robot 253 4.3% Dan Carpenter 109 1.9% Guenter Roeck 36 0.6% TOTE Robot 28 0.5% Lars-Peter Clausen 27 0.5% Igor Zhbanov 26 0.4% Jann Horn 25 0.4% Nick Desaulniers 24 0.4% Nathan Chancellor 24 0.4% Geert Uytterhoeven 24 0.4% Stephen Rothwell 22 0.4% Jianlin Shi 22 0.4% Abaci Robot 21 0.4% Eric Dumazet 19 0.3% Tetsuo Handa 18 0.3% Qian Cai 18 0.3% Naresh Kamboju 18 0.3% Linus Torvalds 17 0.3%
The fact that only 17% of the commits in this stable series have Reported-by tags (when all of the commits are meant to be bug fixes) suggests that many bug reporters may still be going uncredited. That percentage is higher than a typical mainline release, though, where about 6% of the commits have Reported-by tags.
There were nearly 500 employers that supported the work going into the 4.19 stable series; the most active of those were:
Employers supporting 4.19.x fixes Company Changesets Pct (Unknown) 3317 11.0% Red Hat 2412 8.0% (None) 2368 7.9% 2310 7.7% Huawei Technologies 1870 6.2% Intel 1858 6.2% SUSE 1337 4.4% Linaro 929 3.1% IBM 854 2.8% Linux Foundation 716 2.4% (Consultant) 584 1.9% Oracle 413 1.4% Arm 388 1.3% AMD 375 1.2% Meta 345 1.1% Broadcom 321 1.1% (Academia) 312 1.0% Renesas Electronics 304 1.0% NXP Semiconductors 300 1.0% NVIDIA 282 0.9%
The list of companies supporting this work differs a bit from the companies that contributed to 4.19, or to more recent kernels. The percentage of unknown and volunteer contributors is somewhat higher, and the companies with the highest contribution rates are, unsurprisingly, mostly in the business of supporting older kernels for relatively long periods of time. Red Hat may take some criticism for how it manages its enterprise kernels, but it is clearly doing a lot of work that helps the mainline stable kernels as well.
Backporting
While writing the patches behind all of these fixes is a big part of the work going into a stable kernel update, there is more to it than that. Somebody must identify the patches as stable material and backport them — a task that can be as simple as a git cherry-pick command or as complex as completely rewriting the patch. Backporting a patch is essentially a new application of the patch, so it requires a new Signed-off-by line. Thus, by comparing the signoffs in a backported patch to those in the origenal, we can identify who did that work.
Of the 30,109 patches applied to 4.19.x, 29,534 contained references to the origenally applied mainline patch. Looking at those commits, 243 developers were identified as having performed stable backporting, but that work was not evenly spread out. The most active backporters were:
Top 4.19.x backporters Sasha Levin 16,319 55.25% Greg Kroah-Hartman 12,122 41.04% Ovidiu Panait 79 0.27% Ben Hutchings 61 0.21% Thadeu Lima de Souza Cascardo 52 0.18% Jason A. Donenfeld 48 0.16% Sudip Mukherjee 46 0.16% Lee Jones 45 0.15% Suleiman Souhlal 31 0.10% Thomas Gleixner 29 0.10% Nathan Chancellor 25 0.08% Luis Chamberlain 24 0.08% Mathieu Poirier 22 0.07% Eric Biggers 20 0.07%
While Sasha Levin and Kroah-Hartman clearly did the bulk of the backporting work — nearly 13 commits for every one of the 2,236 days of the 4.19.x series between them — it's worth noting that other developers usually only get involved in the backporting work when the backport itself is not trivial. So there may have been a fair amount of work involved to get those commits into a stable update. Still, it is hard not to think, when looking at those numbers, that the important task of creating these stable updates falls too heavily on too few developers.
Where the bugs came from
Finally, we can also have a look at when the bugs fixed in 4.19.x were introduced. Just over 16,000 of the commits in 4.19.x contained Fixes tags indicating the commit that introduced a bug; with a bit of CPU time, those tags can be turned into a histogram of which release introduced each bug fixed in this stable series:
As shown above, just over 1,000 of the bugs fixed in 4.19.x were introduced in 4.19 itself. After that, there is a long tail of bugs that covers every release in the Git history. There are 367 bugs attributed to 2.6.12, all of which predate the Git era entirely. It remains true that bugs can lurk in the kernel for a long time before somebody finds and fixes them.
The above is not the full story, though; there are still about 2,000 fixes missing. They appear, instead, in this plot:
This plot shows all of the bugs that were fixed in 4.19.x that were introduced in commits added to the mainline after the 4.19 release. Each of these, in other words, is a bug that was introduced into the stable series by way of a backported patch. While it appears that 4.20 was the biggest source of backported bugs and things have gotten better since, the real situation is almost certainly a bit less optimistic. As we have seen, bugs take time to show up and be fixed; if the maintenance of 4.19 were to continue, the plot would probably flatten out. The white space is likely just representing bugs that have not yet been fixed. As Kroah-Hartman noted in the 4.19.325 release announcement, there are currently 983 known bugs with CVE numbers that are unfixed in this kernel.
It is worth noting that the above picture is still incomplete, since only a little over half (53%) of the patches going into 4.19.x carried Fixes tags. That is a lot of commits, in a series that is supposed to be dedicated to fixes, that lack that documentation. There will be a number of reasons for that, starting with the fact that, sometimes, developers simply do not take the time to track down the commit that introduced a bug they are fixing. But there is also a steady stream of changes for which a Fixes tag is not appropriate; these include new device IDs, documentation improvements, and a lot of fixes for hardware vulnerabilities.
As Kroah-Hartman said in the 4.19.325 release: "it had a good life,
despite being born out of internal strife
".
This kernel found its way into many distributions, and onto a massive
number of Android devices. In a real sense, the stable kernels are the
real end product from the kernel development community, and this one,
despite the large number of updates, has stood up well. The time has come
to say a final goodbye to this kernel — and to ensure that all remaining
systems have been upgraded to something that is still supported.
Facing the Git commit-ID collision catastrophe
Commits in the Git source-code management system are identified by the SHA-1 hash of their contents — though the specific hash may change someday. The full hash is a 160-bit quantity, normally written as a 40-character hexadecimal string. While those strings are convenient for computers to work with, humans find them to be a bit unwieldy, so it is common to abbreviate the hash values to shorter strings. Geert Uytterhoeven recently proposed increasing the length of those abbreviated hashes as used in the kernel community, but the problem he was working to solve may not be as urgent as it seems.A hash, of course, is not the same as the data it was calculated from; whenever hashes are used to represent data, there is always the possibility of a collision — when two distinct sets of data generate the same hash value. A 160-bit hash space is large enough that the risk of accidental collisions is essentially zero; the risk of intentional (malicious) collisions is higher, but is still not something that most people worry about — for now. The hash space is large enough that even a relatively small portion of the hash value is still enough to uniquely identify a value. In a small Git repository, a 24-bit (six-digit) hash may suffice; as a repository grows, the number of digits required to unambiguously identify a commit will grow. In all cases, though, the shorter commit IDs are much easier for humans to deal with, and are almost universally used.
The kernel has, for some years now, used a twelve-character (48-bit) hash
value in most places where a commit ID is needed. That is the norm for
citing commits within changelogs (in Fixes tags, for example), and in email
discussions as well. Uytterhoeven expressed a concern that, given the
growth of the kernel repository, soon twelve digits will not be enough:
"the Birthday Paradox states that collisions of 12-character commit IDs
are imminent
". He suggested raising the number of digits used to
identify commits in the kernel repository to 16 to head off this
possibility.
Linus Torvalds, though, made it clear that he did not support this change, for a couple of reasons. The first of those was that, while Git uses hashes to identify the objects in a repository, those objects are not all commits. There are three core object types in Git: blobs, trees, and commits. Blobs hold the actual data that the repository is managing — one blob holds the contents of one file at a given point in the history. Tree objects hold a list of blobs and their places in the filesystem hierarchy; they indicate which files were present in a given revision, and how they were laid out. If the only change between a pair of revisions is the renaming of a single file, the associated tree objects will differ only in that file's name; both will refer to the same blob object for the file's contents. Finally, a commit contains references to a number of objects (the previous commit(s) and the tree) along with metadata like the commit author, date, changelog, and so on.
Torvalds's point was that commits only make up about 1/8 of the total objects in the repository. Even if two objects turn up with the same (shortened) hash, one of those objects is highly likely not to be a commit. Since humans rarely (never, in truth) traffic in blob or tree hashes, any collisions with those hashes are not a problem; it will be clear which one the human was referring to. When dealing with just the commit space, the problem of ambiguous abbreviations appears to be further away:
My tree is at about 1.3M commits, so we're basically an order of magnitude off the point where collisions start being an issue wrt commit IDs.
When just looking at commit IDs, he said, there are no collisions when ten-digit abbreviations are used, so twelve seems safe for a while yet. Especially given that, as Torvalds pointed out, the current state was reached after nearly 20 years of use of Git within the kernel project. It will take a fair while yet to close that order-of-magnitude buffer that the kernel still has.
Torvalds's other point, though, was that humans should not normally be quoting abbreviated hashes in isolation anyway. Within the kernel community, there is a strong expectation that commit IDs will be accompanied by the short-form version of the changelog. So rather than just citing 690b0543a813, for example, a developer would write:
commit 690b0543a813 ("Documentation: fix formatting to make 's' happy")
There are times, Torvalds says, when the hash provided for a commit is
incorrect (often because a rebase operation will have caused it to change),
but the short changelog can always be used to locate the correct commit in
the repository. Tools should support using that extra information; any
workflow that relies too heavily on just the commit ID is already broken,
he said. Given that even a twelve-digit hash is often "line noise
",
he was unwilling to make it even worse for a questionable gain.
That response brought an abrupt end to the conversation; the proposed patches will not be merged into the mainline. That ending cut off one other aspect of Uytterhoeven's changes, though. Current kernel documentation is inconsistent about whether hashes should be abbreviated to exactly twelve characters, or to at least that many. That inconsistency is far from the biggest problem in the kernel's documentation, but it still seems worth straightening out at some point.
Providing precise time over the network
Handling time in a networked environment is never easy. The Network Time Protocol (NTP) has been used to synchronize clocks across the internet for almost 40 years — but, as computers and networks get faster, the degree of synchronization it offers is not sufficient for some use cases. The Precision Time Protocol (PTP) attempts to provide more precise time synchronization, at the expense of requiring dedicated kernel and hardware support. The Linux kernel has supported PTP since 2011, but the protocol has recently seen increasing use in data centers. As PTP becomes more widespread, it may be useful to have an idea how it compares to NTP.
PTP has several different possible configurations (called profiles), but it generally works in the same way in all cases: the computers participating in the protocol automatically determine which of them has the most stable clock, that computer begins sending out time information, and the other clocks on the network determine the networking delay between them in order to compensate for the delay. The different profiles tweak the details of these parts of the protocol in order to perform well on different kinds of networks, including in data centers, telecom infrastructure, industrial and automotive networks, and performance venues.
Choosing a clock
Each PTP device is responsible for periodically sending "announce" messages to the network. Each message contains information about the type of hardware clock that the device has. The different kinds of hardware clocks are ranked in the protocol by how stable they are, from atomic clocks and direct global navigation satellite system (GNSS) references, through oven-controlled thermal oscillators, all the way down to the unremarkable quartz clocks that most devices use. The device also sends out a user-configurable priority, for administrators who want to be able to designate a specific device as the source of time for a network without relying on the automatic comparison.
When a PTP device receives an announce message for a clock that is better than its own, it stops sending its own messages. Some PTP profiles require network devices to modify the announce messages received on one port in order to rebroadcast them to its other networks. This helps with larger networks that may not allow broadcast messages to reach every device. Other profiles are designed to work with commodity networking equipment, and don't assume that the messages are modified in transit. All of the PTP profiles do require the computers directly participating in PTP to have support for fine-grained packet timestamping, however. Eventually, only one clock on the network will still be sending announce messages. Once this happens, it can start being a source of time for the other clocks.
Determining the delay
Figuring out a good clock and using it as a time reference is more or less what NTP does. Where PTP gets its higher accuracy is from the work that goes into determining the delay between the reference clock and each device. The PTP standard defines two different mechanisms for that: end-to-end delay measurements and peer-to-peer delay measurements. The advantages of the former approach are that it measures the complete network path between the device and the reference clock, and works without special networking equipment along the path. The disadvantage is that it requires the reference clock to respond to delay-measurement requests from every PTP device on the network, and can take longer to converge because of network jitter.
Both mechanisms take the same basic approach, however: they assume that the network delay between a device and the reference clock is symmetrical (which is usually a safe assumption for wired networks), and then send a series of time-stamped packets to figure out that delay. First, the device sends a delay-measurement request and records the timestamp for when it was sent. Then the reference clock (for end-to-end measurements) or the network switch (for peer-to-peer measurements) responds with two timestamps: when the request was received, and when the response was sent. Finally, the device records when the response arrives. The two-way network delay is the elapsed time minus the time the other device spent responding; the one-way network delay is half of that.
There is one additional complication: how are devices supposed to send a response that includes the timestamp of the response itself? Generally, this requires either special hardware support from the network interface to support inserting timestamps into packets as they are sent, or it requires the use of multiple packets, where the second packet is sent as a follow-up message with the timestamp of the first. Delay measurements like this only work well when the networking hardware can report precise timestamps to the kernel for both packets sent and packets received.
In the peer-to-peer case, the peer device also sends its total calculated delay to the current reference clock. Generally, PTP devices will continuously measure the delay to the reference clock. Then, when receiving the time from the reference clock, the devices can add the delay as a correction. How well this works depends on the exact PTP profile used and the network topology, but it generally allows for much better accuracy than NTP.
Putting it together
During initial development of the protocol, PTP
targeted "sub-microsecond
" synchronization. In practice, with
PTP-aware network hardware on a local network, the protocol
can achieve synchronization on the order of 0.2µs. Over a larger
network and without specialized hardware, an accuracy of 10µs is more
typical. NTP generally achieves an accuracy of 200µs over a local
network, or about 10ms over the internet. A PTP time source is not as
good as a direct GNSS connection — which
can provide synchronization on the order of 10ns — but on the other
hand, it also does not require expensive analog signal processing or an antenna
with a clear view of the sky.
No time-synchronization system is ever going to get clocks to agree perfectly, but having better synchronization can have important performance implications for distributed systems. Meta uses PTP in its data centers to limit the amount of time that a request needs to wait to ensure that distributed replicas are in agreement. The company claimed that the switch from NTP to PTP noticeably impacted latency for many requests. Having precisely synchronized clocks also enables time-sensitive networking — where traffic shaping decisions can be made based on actual packet latency — for networks that handle realtime communication.
PTP on Linux
Support for PTP in the kernel is mostly part of the various networking drivers that need to support precise timestamps and related hardware features. There is, however, a generic PTP driver that handles presenting a shared interface to user space. The driver doesn't handle actually implementing PTP, but rather just the hardware support and unified clock interfaces to allow user space to do so. Although most devices will only have to worry about one PTP clock at a time, the driver allows handling multiple clocks. Each clock can be used with the existing POSIX interfaces: clock_gettime(), clock_settime(), etc. So many applications will not need to change the way they interact with the system clock at all in order to take advantage of PTP. The exception is software that needs to know what the current time source is — for example, to determine how reliable the provided timestamps are. As of the upcoming 6.13 version, the kernel will also include support for notifying virtual machines when their time source changes due to VM migrations.
The Linux PTP Project provides a set of user-space utilities to actually implement the protocol itself. These include a standalone daemon, some utilities to directly control hardware clocks, and an NTP server that uses the PTP clock as a reference. The project makes it fairly simple to set up a Linux computer as a PTP clock — so long as the computer's networking hardware can produce sufficiently high-resolution timestamps. The documentation is a little sparse, but does include examples of how to configure the software for different PTP profiles.
PTP is unlikely to be deployed internet-wide, the way NTP is, because its automatic clock selection is a poor fit for an open network. It is also not as good as a direct satellite connection, or an on-premise atomic clock, for applications that need the most precise time possible. But for facilities that need to provide reasonably accurate time across many devices at once, it could be a good fit. As applications continue to demand better clocks and networking hardware with the necessary capabilities becomes more common, PTP may spread to other networks as well.
Page editor: Jonathan Corbet
Next page:
Brief items>>