Kernel development

Brief items

Kernel release status

The current development kernel is 4.11-rc4, released on March 26. Linus said: "So on the whole things look fine. There's changes all over, and in mostly the usual proportions. Some core kernel code shows up in the diffstat slightly more than it usually does - we had an audit fix and a bpf hashmap fix, but on the whole it all looks very regular".

Stable updates: 4.10.6, 4.9.18, and 4.4.57 were released on March 27.

Comments (none posted)

Quotes of the week

I love time-traveling maintainers! They are very tolerant of people who don't double-check -next first.

— Kees Cook

Most IOT targets are so small that people are rewriting new operating systems from scratch for them. Lots of fragmentation already exists. We're talking about systems with less than one megabyte of RAM, sometimes much less. Still, those things are being connected to the internet. And this is going to be a total security nightmare.

I wish to be able to leverage the Linux ecosystem for as much of the IOT space as possible to avoid the worst of those nightmares.

— Nicolas Pitre

Code should make sense, otherwise it's not going to be maintainable. Naming matters. If the code doesn't match the name of the function, that's a bug regardless of whether it has semantic effects or not in the end - because somebody will eventually depend on the _expected_ semantics.

— Linus Torvalds

Comments (none posted)

Eudyptula Challenge Status report

The Eudyptula Challenge is a series of programming exercises for the Linux kernel. It starts from a very basic "Hello world" kernel module, moves up in complexity to getting patches accepted into the main kernel. The challenge will be closed to new participants in a few months, when 20,000 people have signed up. LWN covered the Eudyptula Challenge in May 2014, when it was fairly new. At this time over 19,000 people have signed up and only 149 have finished.

Full Story (comments: 22)

Kernel podcast for March 28

The March 28 kernel podcast is out. "In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs."

Comments (none posted)

Kernel development news

Sharing pages between mappings

By Jonathan Corbet
March 26, 2017

LSFMM 2017

In the memory-management subsystem, the term "mapping" refers to the connection between pages in memory and their backing store — the file that represents them on disk. One of the fundamental assumptions in the kernel is that a given page in the page cache belongs to exactly one mapping. But, as Miklos Szeredi explained in a plenary session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, there are situations where it would be desirable to associate the same page with multiple mappings. Achieving this goal may not be easy, though.

Szeredi is working with the overlayfs filesystem, which works by stacking a virtual filesystem on top of another filesystem to provide a modified view of that lower filesystem. When pages from the real file in the lower filesystem are read, they show up in the page cache. When the upper filesystem is accessed, the virtual file at that level is a separate mapping, so the same pages show up a second time in the page cache. The same sort of problem can arise in a single copy-on-write (COW) filesystem like Btrfs; different files can share the same data on disk, but that data is duplicated in the page cache. At best, the result of this duplication is wasted memory.

Kirill Shutemov noted that anonymous memory (program data that does not have a file behind it) has similar semantics; a page can appear in many different address spaces. For anonymous pages, the anon_vma mechanism allows the kernel to keep track of everything and provides proper COW semantics. Perhaps something similar could be done with file-backed pages.

James Bottomley said that the important questions were how much it would cost to maintain these complicated mappings, and how coherence would be maintained. He pointed out that pages could be shared, go out of sharing for a while, then become shared again. Perhaps, he said, the KSM mechanism could be used to keep things in order. Szeredi said he hadn't really thought about all of those issues yet.

On the question of cost, Josef Bacik said that his group had tried to implement this sort of feature and found it to be "insane". There are a huge number of places in the code that would need to be audited for correct behavior. There would be a lot of real-world benefits, he said, but he decided that it simply wasn't worth it.

Matthew Wilcox suggested a scheme where there would be a master inode on each filesystem with other inodes sharing pages linked off of it. But Al Viro responded that this approach has its own challenges, since the inodes involved do not all have to be on the same filesystem. Given that, he asked, where would this master inode be? Bacik agreed, saying that he had limited his exploration to single-filesystem sharing; things get "even more bonkers" if multiple filesystems are involved. If this is going to be done at all, he said, it should be done on a single filesystem first.

Bottomley said that the problems come from attempting to manage the sharing at the page level. If it were done at the inode level instead, things would be easier. Viro said that inodes can actually share data now, but it's an all-or-nothing deal; there is no way to share only a subset of pages. At that level, this functionality has worked for the last 15 years. But, since the entire file must be shared, Szeredi pointed out, the scheme falls down if the sharing must be broken at some point — if the file is written, for example. Viro suggested trying to steal all of the pages when that happens, but Szeredi said that memory mappings would still point to the shared pages.

Bottomley then suggested stepping back and considering the use cases for this feature. Users with lots of containers, he said, want to transparently share a lot of the same files between those containers; this sort of feature would be useful in such settings. Bacik added that doing this sharing at the inode level would lose a lot of flexibility, but it might be enough for the container case which, he said, might be the most important case. Jan Kara suggested simply breaking the sharing when a file is opened for write, or even requiring that users explicitly request sharing, but Bottomley responded that container users would not do that.

The conclusion from the discussion is that per-inode sharing of pages between mapping is probably possible if somebody were sufficiently motivated to try to implement it. Per-page sharing, instead, was widely agreed to be insane.

Comments (9 posted)

The future of DAX

By Jonathan Corbet
March 27, 2017

LSFMM 2017

DAX is the mechanism that enables direct access to files stored in persistent memory arrays without the need to copy the data through the page cache. At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Ross Zwisler led a plenary session on the future of DAX. Development in this area offers a number of interesting trade-offs between data safety and enabling the highest performance.

The biggest issue for next year, Zwisler said, is finding the best way to handle flushing of data from user space. Data written to persistent memory by the CPU may look like it is permanently stored but, most likely, it has only made it as far as the cache; that data can still be lost in the event of a crash, power failure, or asteroid strike. For pages in the page cache, user space can use msync() to flush the data to persistent storage, but DAX pages explicitly avoid the page cache. So flushing data to permanent storage requires going through the radix tree, finding the dirty pages, and flushing the associated cache lines. Intel provides some instructions for performing this flushing quickly; the kernel will use those instructions to ensure that data is durable after an msync() call. So far, so good.

The problem is that there are use cases where msync() is too slow, so users want to avoid it. Instead, they would like to write and flush individual chunks of data themselves without calling into the kernel. This method can be quite a bit faster, since the application knows which data it has written, while the kernel lacks the information to flush data at the individual cache-line level.

This technique works as long as no file-data allocations have been done in the write path. Otherwise, there will be changed filesystem metadata that also needs to be flushed, and that will not happen in this scenario. As a result, data can be lost in a crash. A number of solutions to this problem have been proposed, but, according to Zwisler, Dave Chinner has called them all "crazy". A safer approach, Chinner said last September, is to simply require that files be completely preallocated before writing begins; at that point, there should be no metadata changes and the problem goes away.

Rik van Riel suggested that applications should be required to open files with the O_SYNC option if they intend to access them in this mode, but Zwisler responded that the situation is not that simple. Jan Kara said that the problem could come from other applications performing operations in the filesystem that create metadata changes; those applications may be completely unaware of the other users and will not be concerned with flushing their changes out. Getting around that problem would require some sort of state stored at the inode level and not, like O_SYNC, at the file-descriptor level.

But even then, the filesystem itself can destabilize the metadata by, for example, performing deduplication. In the end, Kara said, the only way for an application to know that a filesystem is in a consistent state on-disk is to call fsync(). Moving control of flushing to user space breaks a lot of assumptions; there will need to be a way to prevent filesystems from messing with things.

Zwisler said that Chinner's proposal had anticipated this problem and, as a result, came with a lot of caveats. It would be necessary to turn off reflink functionality and other filesystem features, for example. Zwisler also said that device DAX, which presents persistent memory as a character device without a filesystem, exists for this kind of thing; device DAX gives the user total control. For the filesystem implementation, it might be best to just go with the preallocation idea, he said, while making it painful enough that there will be an incentive not to use it. But the incentives to use it will also be there: by avoiding system calls, the user-controlled method is always going to be faster.

Kara said that history shows that, if somebody is interested in a feature, businesses will work to provide it. With enough motivation, these problems can be solved. Zwisler said that there is a strong desire to have a filesystem in place on persistent memory; filesystems provide or enable nice features like naming, backups, and more. What is really needed is a new filesystem that was designed for persistent memory from the beginning, but that is not a short-term solution. Even if such a filesystem were to appear tomorrow, it's a rare user who is willing to trust production data to a brand-new filesystem. So we are going to have to get by with what we have now for some time yet.

The group concluded that, for now, users will have to get by with limiting metadata updates or using device DAX. With luck, adventurous users will experiment with other ideas out of tree and better solutions will eventually emerge.

The next question had to do with platforms that support "flush on fail" functionality — the ability to automatically flush data to persistent memory after a crash. On such hardware, there is no need to worry about doing explicit cache flushes; indeed, doing so will just slow things down. The big problem here is that there is no discovery method for this feature, so the user must ask for flushes to be turned off if they know that their hardware will do flush on fail. A feature to allow that will be provided; it is seen as being similar to the ability to turn off writeback caching on hard drives.

Currently DAX is still marked as an experimental feature in the kernel, and mounting a filesystem with DAX enabled results in a warning in the log. When, Zwisler asked, can this be turned off? Support for the reflink feature, or at least the ability to "not collide with it" seems to be one remaining requirement; it is evidently being worked on. Dan Williams noted that DAX is currently turned off if page structures are not available for the persistent-memory array. It is possible to operate without those structures, but there is no support for huge pages, fork() will fail if persistent memory is mapped, and it's not possible to use a debugger on programs that have that memory mapped. He asked whether this was worth fixing, noting that it would not be a small job. Interest in addressing the issue seemed relatively low in the room.

Zwisler said that the filesystem mount options for DAX are currently inconsistent. With ext4, DAX either works for all files or it doesn't work at all; XFS, instead, can enable or disable DAX on a per-inode basis. It would be better, he said, to have consistent behavior across filesystems before proclaiming the feature to be stable.

Another wishlist feature is support for 1GB extra-huge pages. Device DAX can use such pages now, but they are not available when there is a filesystem involved. Fixing that problem would be relatively complex; among other things, it would require filesystems to lay out files in 1GB-aligned extents, which none do now. It is not clear that there is a use case for this feature, so nobody seems motivated to make it work now.

The session concluded with a review of the changes needed to remove the "experimental" tag from DAX. More testing was added to the list; it's not clear if the test coverage is as good as it need to be yet or not. The concerns about interaction with reflink need to be addressed, and making the mount options consistent is also on the list (though some developers would like to just see the mount options go away entirely). That list is long enough that the future of DAX seems to include "experimental" status for a little while yet.

Comments (4 posted)

Huge pages in the ext4 filesystem

By Jonathan Corbet
March 28, 2017

LSFMM 2017

When the transparent huge page feature was added to the kernel, it only supported anonymous (non-file-backed) memory. In 2016, support for huge pages in the page cache was added, but only the tmpfs filesystem was supported. There is interest in expanding support to other filesystems, since, for some workloads, the performance improvement can be significant. Kirill Shutemov led the only session that combined just the filesystem and memory-management tracks at the 2017 Linux Storage, Filesystem, and Memory-Management Summit in a discussion of adding huge-page support to the ext4 filesystem.

He started by saying that the tmpfs support works well now, so it's time to take the next step and support a real filesystem. Compound pages are used to represent huge pages in the system memory map; the first of the range of (small) pages that makes up a huge page is the head page, while the rest are tail pages. Most of the important metadata is stored in the head page. Using compound pages allows the entire huge page to be represented by a single entry in the least-recently-used (LRU) lists, and all buffer-head structures, if any, are tied to the head page. Unlike DAX, he said, transparent huge pages do not force any constraints on a file's on-disk layout.

With tmpfs, he said, the creation of a huge page causes the addition of 512 (single-page) entries to the radix tree; this cannot work in ext4. It is also necessary to add DAX support and to make it work consistently. There are a few other problems; for example, readahead doesn't currently work with huge pages. The maximum size of the readahead window is 128KB, far less than the size of a huge page. He was not sure if that was a big deal or not but, if it is, it will need to be fixed. Huge pages also cause any shadow entries in the page cache to be ignored, which could worsen the system's page-reclaim decisions.

He emphasized that huge pages need to avoid breaking existing semantics. That means that it will be necessary to fall back to small pages at times. Page migration was one example of when that can happen. A related problem is that a lot of system calls provide 4KB resolution, and that can interfere with huge-page use. Use of encryption in ext4 will also force a fallback to small pages.

Given all that, he asked, is there any reason not to pursue the addition of huge-page support to ext4? He has patches that have been circulating for a while; his current plan is to rebase them onto the current page cache work and repost them.

Jan Kara asked if there was a need to push knowledge of huge pages into every filesystem, adding complexity, or if it might be possible for filesystems to always work with small pages. Shutemov responded that this is not always an option. There is, for example, a single up-to-date flag for the entire compound page. It makes sense to work to make the abstractions cleaner and hide the differences whenever possible, and he has been doing that, but the solution is not always obvious.

Kara continued, saying that there needs to be some sort of proper data structure for tracking sub-page state. The kernel currently uses a list of buffer-head structures, but that could perhaps be changed. There might be an advantage to finer-grained tracking. But he repeated that he doesn't see a reason why filesystems should need to know about the size of pages as stored in the page cache, and that teaching every filesystem about a variably sized page cache will be a significant effort. Shutemov agreed with the concern, but said that the right approach is to create an implementation for a single filesystem, get it working, then try to create abstractions from there.

Matthew Wilcox, instead, complained that the current work only supports two page sizes, while he would like it to handle any compound page size. Generalizing the code to make that possible, he said, would make the whole thing cleaner. The code doesn't have to actually handle every size from the outset, but it should be prepared for that.

Trond Myklebust said that he would like to have proper support for huge pages in the page cache. In the NFS code, he has to do a lot of looping and gathering to get up to reasonable block sizes. Ted Ts'o asked whether the time had come to split the notion of a page's size (PAGE_SIZE) and the size of data stored in the page cache (PAGE_CACHE_SIZE). The kernel used to treat the two differently, but that distinction was removed some time ago, resulting in cleaner code. Wilcox responded that the meaning of PAGE_CACHE_SIZE was never well defined in the past, and that generalizing the handling of page-cache size is not a cleanup, it's a performance win. He suggested it might also make it easier to support multiple block sizes in ext4, though Shutemov was quick to add that he couldn't promise that.

The problem with larger block sizes, Ts'o said, comes about when a process takes a fault on a 4KB page, and the filesystem needs to bring in a larger block. This has never been easy. The filesystem people say it's a memory-management problem, while the memory-management people point their finger at filesystems. This situation has stayed this way for a long time, he said. Wilcox said he wants it to be a memory-management problem; his work to support variable-sized pages in the page cache should address much of it.

Andrea Arcangeli said that the real problem happens when larger pages are not available for allocation. The transparent huge pages code is careful to never require such allocations; it will always fall back to smaller pages. He would not like to see that change. Instead, he said, the real solution is to increase the base page size. Rik van Riel answered that, if the page cache contains more large pages, they will be available for reclaim and should be easier to allocate than they are now.

As the session closed, Ts'o observed that the required changes are much larger on the memory-management side than on the ext4 side. If the group is happy with this work, perhaps it's time to merge it with the idea that the remaining issues can be fixed up later. Or, perhaps, it's better to try to further evolve the interfaces first. It is, he said, more of a memory-management decision, so he will defer to that group. Shutemov said that the page-cache interface is the hardest part; he will look at making the interface with filesystems cleaner. But, he warned, it doesn't make sense to try to abstract everything from the outset.

Comments (1 posted)

Supporting shared TLB contexts

By Jonathan Corbet
March 28, 2017

LSFMM 2017

A processor's translation lookaside buffer (TLB) caches the mappings from virtual to physical addresses. Looking up virtual addresses is expensive, so good performance often depends on making the best use of the TLB. In the memory-management track of the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Mike Kravetz described a SPARC processor feature that can improve TLB performance and explored ways in which that feature could be supported.

On most processors, context switches between processes are expensive operation because they force the contents of the TLB to be flushed. SPARC differs, though, in that TLB entries carry a tag associating them with a specific context. Since the processor knows to ignore TLB entries that do not correspond to the process that is executing, there is no need to flush the TLB on context switches. That takes away much of the context-switch penalty, and, as a result, improves performance.

The SPARC context register has been supported in Linux for a long time. But, Kravetz said, recent SPARC processors have added a second register, meaning that any given process can be associated with two independent contexts at the same time. Kravetz, an Oracle employee, said that this helps these processors support "the most important application in the world" — the Oracle database — which is built around a set of processes working on a large shared-memory area. If the second context ID is assigned to that area, then the associated TLB entries can be shared across all of those processes.

He has posted a patch set allowing this register to be used for shared-memory areas. The patch is "80% SPARC code", though, so nobody but Dave Miller (the SPARC maintainer) has looked at it, he said. His hope was to draw more attention to this feature and work out the best way to expose the functionality of this second context ID to user space.

His thinking is to have a special virtual memory area (VMA) flag to indicate a memory region with a shared context. But that leaves the question of how that flag should be set; Kirill Shutemov observed that it could be difficult to provide a sane interface for this feature. Kravetz's proposal added a special flag to the mmap() and shmat() system calls. One nice feature of this approach is that it does not require exposing the shared-context ID to user space. Instead, the kernel sees that the flag was set, assigns a context ID, and ensures that all processes mapping the same area at the same virtual address use the same context.

Matthew Wilcox suggested that perhaps madvise() would be a better system call for this functionality. The problem with madvise(), Kravetz said, is that it creates an inherent race condition. The shared context ID is stored in the page-table entries, so it needs to be set up before any of those entries are created. In particular, it needs to be in place before the process faults any of the pages in the shared region. Otherwise, those prematurely faulted pages will not be associated with the shared ID.

Kravetz's first patch set only supported pages mapped from hugetlbfs, which was enough to cover the Oracle shared-memory area. But he noted that it would be nice to cover executable mappings as well. While that would enable the shared ID to be used with shared libraries; the more immediate use case was the Oracle database executable, of course. Dave Hansen reacted to this idea by observing that Oracle seems to be trying to glue its multiprocess implementation back into a single process. (This feature, it should be noted, would not play well with address-space layout randomization, since all mappings must be to the same virtual address).

It was suggested that, in essence, hugetlbfs is a second memory-management subsystem for the kernel, providing semantics that the original lacked. DAX, perhaps, is developing into a third. The shared-context flag is needed because hugetlbfs is a second subystem; otherwise, things would be shared more transparently. So perhaps the real answer is to get rid of hugetlbfs? The problem with that idea, Andrea Arcangeli said, is that hugetlbfs will always have a performance advantage over transparent huge pages because the huge pages are reserved ahead of time. There are not many hugetlbfs users out there, but those few really want it.

Arcangeli went on to say that the real problem with TLB performance is that Linux is still using small (4KB) pages; someday that page size is going to have to increase. Shutemov said that increase would be an ABI break, but Arcangeli countered that, when the x86-64 port was done, care was taken to not expose any boundaries smaller than 2MB to user space. That takes care of most potential ABI issues (on that architecture), but there are still cases where user space sees the smaller page size — mprotect() calls, for example. So Linux will not be able to get completely away from small pages anytime soon.

As the end of the session approached, Rik van Riel pulled the conversation back to the main topic by asking if there were any action items. It seems that there are no known bugs in Kravetz's patch set, other than the fact that it is limited to hugetlbfs, which ignores memory-allocation policies, cpusets, and more. Mel Gorman said that, since hugetlbfs is its own memory-management subsystem, it can do what it wants in that area; Michal Hocko suggested simply documenting the things that don't work properly. The final question came from Hansen, who asked whether this feature was really important or not. The answer seems to be "yes, because Oracle wants it".

Comments (8 posted)

The next steps for userfaultfd()

By Jonathan Corbet
March 29, 2017

LSFMM 2017

The userfaultfd() system call allows user space to intervene in the handling of page faults. As Andrea Arcangeli and Mike Rapoport described in a 2017 Linux Storage, Filesystem, and Memory-Management Summit session dedicated to the subject, userfaultfd() was originally created to help with the live migration of virtual machines between physical hosts. It allows pages to be copied to the new host on demand, after the machine itself has been moved, leading to faster, more predictable migrations. Work on userfaultfd() is not finished, though; there are a number of other features that developers would like to add.

In the 4.11 kernel, Arcangeli said, userfaultfd() can handle faults for missing pages, including anonymous, hugetlbfs, and shared-memory pages. There is also handling for a number of "non-cooperative events" (where the fault handler is unknown to the process whose faults are being managed) including mapping, unmapping, fork(), and more. At this point, though, only faults for not-present pages are managed; there would be value in dealing with other types of faults as well.

In particular, Arcangeli is looking at write-protect faults, where the page is present in memory but is not accessible for writing. There are a number of use cases for this feature, many based on the idea that it allows the efficient removal of a range of memory from a region. That can be done with munmap() as well, but that results in split virtual memory area (VMA) structures and thus hurts performance.

One potential use is efficient live snapshotting of running processes. The process could create a thread that would write the relevant memory to the snapshot. Memory that has been so written would then be write protected, generating faults when the main process tries to write there. Those faults can be used to copy the modified pages (and only those) to the snapshot. This feature could also be used to throttle copy-on-write faults, which are blamed for latency issues in some applications (Redis, for example).

Another possible use case is getting rid of the write bits in language runtime engines. Getting faults on writes would allow the runtime to efficiently track which pages have been written to. Beyond that, it could help improve the robustness of shared-memory applications by catching writes to file holes. It could be used to notice when a malicious guest is trying to circumvent the balloon driver and use more memory than it has been allocated, implement distributed shared memory, or implement the long-desired volatile ranges functionality.

At the moment, he has handling of write-protect faults working but it reports all faults, not just those in the regions requested by the monitoring process. That, of course, means the monitor gets a lot of spurious events that must be filtered out.

Rapoport talked briefly about the non-cooperative userfaultfd() mode, which was merged for the 4.11 kernel. It has been added mainly for the container case; it allows, for example, the efficient checkpointing of containers. Recent work has added events for fork(), mremap(), and munmap(), but there are still some holes, including the fallocate() PUNCH_HOLE command and madvise(MADV_FREE).

The handling of events is currently asynchronous, but, for this case, Rapoport said, there would be value in supporting synchronous events as well. There are also problems with pages shared between multiple processes resulting in the creation of multiple copies. Fixing that would require an operation to inject a single page into multiple address spaces at once.

Perhaps the trickiest remaining problem, though, is using userfaultfd() on processes that are, themselves, using userfaultfd(). Fixing that will require adding a mechanism that allows the chaining of events. The first process (the checkpoint/restart mechanism, for example) would get all events, including a notification when the monitored process starts using userfaultfd() too. After that, events could be handled directly or passed down to the next level. There are a number of unanswered questions around nested use of userfaultfd(), though, so a complete solution is probably some time away.

Comments (1 posted)

Memory-management patch review

By Jonathan Corbet
March 29, 2017

LSFMM 2017

Memory-management (MM) patches are notoriously difficult to get merged into the mainline kernel. They are subjected to a high degree of review because this is an area where it is easy to get things wrong. Or, at least, that is how it used to be. The final memory-management session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit was concerned with patch review in the MM subsystem — or the lack of it.

Michal Hocko started the session off by saying that too many patches get into Andrew Morton's ‑mm tree without proper review. Fully half of them, he said, lack an Acked-by or Reviewed-by tag. But that is only part of the problem: even when patches do carry tags indicating that review has been done, that review is often superficial at best, focusing on little details. Reviewers are not taking the time to think about the real problem, he said. As a result, MM developers are "building hacks on other hacks because nobody remembers why they were added in the first place".

As an example, he raised memory hotplug, and the care that is taken when shifting pages between memory zones. But much of that could be avoided by simply not assigning pages to zones as early as happens now. MM developers were used to working around this issue, he said, and so never really looked into it. In the end, this is turning the MM subsystem into an unmaintainable mess that is getting worse over time. How, he asked, can we get more review for MM patches, as befits a core kernel subsystem? How can we get review that really matters, and how can we force submitters to fix the problems that are found?

One option, Hocko said, is to make it mandatory that every MM patch have at least one review tag. That, he said, is likely to slow things down considerably. There are 100-150 MM patches merged in each development cycle; if the 50% or so of them without review tags are held back, a lot less will get merged. Is the community OK with that?

Kirill Shutemov said that, if reviews are required to get patches merged, there will also need to be a way to get developers to do those reviews. Hocko agreed, saying that few developers are reviewing patches now. Mel Gorman said that requiring reviews might be fair, but there should be one exception: when developers modify their own code. In general, the principal author should not need as much review for subsequent modifications.

Morton said that a lot of patches do not really require review; many of them are trivial in nature. When review does happen, he said, the quality can vary considerably; there are some Reviewed-by tags that he doesn't believe at all. Gorman agreed that reviews need to have some merit to be counted.

Johannes Weiner worried that requiring reviews could cause work to fall through the cracks. Obscure bug fixes might not get merged, and memory-hotplug work could languish. Memory hotplug is a particular "problem child", Morton said; there is a lot of drive-by work and he has no idea who can review it. Quite a few people, Hocko added, are pursuing their own use case and don't really care about the rest. Part of the problem, Morton said, is that nobody really wants to clean up memory hotplug and, even if they did, they don't have the hardware platforms that would allow them to test the result.

Gorman said that it is important to start enforcing some sort of rule around review. Patches that still need review should have a special tag in the -mm tree. If the percentage of patches so tagged is too high when the -rc5 prepatch comes out, developers who have pending patches should be conscripted to do some review work. That would, at least, encourage the active developers to do a bit more review work.

Hocko then went back to the issue of trivial patches which, he said, are a bigger problem than many people think. Many of them are broken in obscure ways and create problems. Gorman suggested simply dropping trivial patches that have no user impact. Morton said that he could make an effort to be more careful when taking those patches, but that his attempts to get reviews for these patches are often ignored. If the people who have touched a certain file ignore a patch to it, Gorman said, that patch should just be dropped.

Morton replied that he is reluctant to mandate a system where it's impossible to get changes into the kernel if you can't get them reviewed. People get busy or take vacations, and many of those patches are changes that we want anyway. Dropping them would be detrimental to the kernel as a whole. Hocko said that XFS is now mandating reviews for all changes, and it doesn't appear to be suffering from patches being dropped on the floor.

The discussion then shifted to high-level design review, with Hocko saying that high-level review is hard and he wishes we had more of it, but it is not the biggest problem. The real issue is that we have more submitters of changes than reviewers of those changes. Morton said that he would push harder to get patches reviewed, and would do a walk-through around -rc5 to try to encourage review for specific patches needing it.

Morton said there are particular problems around specific patch sets that never seem to get enough review. Heterogeneous memory management is one of those; it is massive, and somebody spent a lot of time on it, but there don't seem to be a whole lot of other people who care about it. The longstanding ZONE_CMA patches are another example. There is a demand for this work, but it has been blocked, he said, partly because Gorman doesn't like it. Gorman replied that he still thinks it's not a good idea, and "you're going to get a kicking from it", but if the people who want that feature want to maintain it, they should go for it; it doesn't affect others. So he will not block the merging of that code.

Hocko raised the topic of the hugetlbfs code, which is complex to the point that few developers want to touch it. Perhaps, he said, hugetlbfs should be put into maintenance mode with no new features allowed. The consensus on this idea seemed to be that the MM developers should say "no more" to changes in this area, but not try to impose strict rules.

Another conclusion came from Morton, who encouraged the MM developers to be more vocal on the mailing lists. The volume on the linux-mm list is reasonable, so there is no real excuse for not paying attention to what is happening there. Developers should, he said, "hit reply more often". Gorman agreed, but said that there need to be consequences from those replies; if a developer pushes back on a patch, that patch needs to be rethought.

By that time, the end of LSFMM was in sight, and thoughts of beer began to take over. Whether this discussion leads to better review of MM patches remains to be seen, but it has, if nothing else, increased awareness of the problem.

Comments (5 posted)

Stream ID status update

By Jake Edge
March 29, 2017

LSFMM 2017

Stream IDs as a way for the host to give storage devices hints about what kind of data is being written have been discussed before at LSFMM. This year, Andreas Dilger and Martin Petersen led a combined storage and filesystem session to update the status of the feature.

Dilger began by noting that the feature looked like it was moving forward and would make its way into the kernel, but hasn't. There are multiple use cases for it, including making it easier for SSDs to decide where to store data to reduce the amount of copying needed when garbage collecting. It would also help developers using blktrace to do analysis at the block layer and could help bcachefs make better decisions about what to put in flash or on disk.

Embedding a stream ID in block I/O requests would help with those cases and more, he said. It would allow all kinds of storage to make better allocation and scheduling decisions. But development on it seems to have gone quiet, so he was hoping to get an update from Petersen (and the others in the room) on the status of stream IDs.

Petersen said that he ran some benchmarks using stream IDs and "all the results were great". But the storage vendors seem to have lost interest. They are off pursuing deterministic writes, he said. Deterministic writes are a way to avoid the performance hiccups caused by background tasks (like wear leveling and garbage collection) by writing in the "proper" way.

But Jens Axboe thought that that stream IDs should still be worked on. He would like to see a small set of stream IDs (two, perhaps) that simply gave an advisory hint of whether the data is likely to be short-lived or long-lived. That would mean there don't need to be a bunch of different flags that would need to be agreed upon and defined. He prefers to simply separate data with different deletion characteristics.

Dilger said that filesystems could provide more information that might help the storage devices make even better decisions on data placement. Some fairly simple information on writes of metadata versus user data would help. Axboe wondered if an API should be exposed so that applications could tell the kernel what kind of data they were writing, but Dilger thought that the kernel is able to provide a lot of useful information on its own.

Ted Ts'o asked if it would be helpful to add a 32-bit stream ID to struct bio that blktrace would display. Petersen said he had been using 16-bit IDs because that's what the devices use, but more bits would be useful for tracing purposes. Dilger said that he didn't want the kernel implementation to be constrained by the hardware; there will need to be some kind of mapping of the IDs in any case. The only semantic that would apply is that writes with the same ID are related to each other in some fashion.

The hint that really matters is short-lived versus not short-lived, Axboe believes. So it makes sense to just have a simple two-stream solution. That will result in 99% of the benefit, he said. But an attendee said that only helps for flash devices, not shingled magnetic recording (SMR) devices and others. In addition, Ts'o thought that indicating filesystem journal writes was helpful. Petersen agreed that it made a big difference for SMR devices.

Axboe said that he had a patch set from about a year ago that he will dust off and post to the list soon. The discussion whether an API is needed and, if so, what it should look like, can happen on the mailing list. Once the kernel starts setting stream IDs, though, there may be performance implications that will need to be worked out. In some devices, the stream IDs are closely associated with I/O channels on the device, so that may need to be taken into account.

Comments (none posted)

Network filesystem cache-management interfaces

By Jake Edge
March 29, 2017

LSFMM 2017

David Howells led a discussion on a cache-management interface for network filesystems at the first filesystem-only session of the 2017 Linux Storage, Filesystem, and Memory-Management Summit. For CIFS, AFS, NFS, Plan9, and others, there is a need for user space to be able to explicitly flush things out of the cache, pin things in the cache, and set cache parameters of various sorts. Howells would like to see a generic mechanism for doing so added to the kernel.

That generic mechanism could be ioctl() commands or something else, he said. It needs to work for targets that you may not be able to open and for mount points without triggering the automounter. There need to be some query operations to determine if a file is cached, how big the cache is, and what is dirty in the cache. Some of those will be used to support disconnected operation for network filesystems.

There are some cache parameters that would be set through the interface as well. Whether an object is cacheable or not, space reservation, cache limits, and which cache should be used are all attributes that may need to be set. It is unclear whether those settings should only apply to a single file or to volumes or subtrees, he said.

Disconnected operation requires the ability to pin subtrees into the cache and to tell the filesystem not to remove them. If there is a change to a file on the server while in disconnected-operation mode, there are some tools to merge the files. But changes to directory structure and such could lead to files that cannot be opened in the normal way. The filesystem would need to return ECONFLICT or something like that to indicate that kind of problem.

Howells suggested a new system call that looked like:

    fcachectl(int dirfd, const char *pathname, unsigned flags, 
              const char *cmd, char *result, size_t *result_len);

He elaborated somewhat in a post about the proposed interface to the linux-fsdevel mailing list.

There were some complaints about using the dirfd and pathname parameters; Jan Kara suggested passing a file handle instead. Howells is concerned that the kernel may not be able to do pathname resolution due to conflicts or may not be able to open the file at the end of the path due to conflicted directories. Al Viro said that those should be able to be opened using O_PATH.

Trond Myklebust asked what would be using the interface; management tools "defined fairly broadly" was Howells's response. Most applications would not use the interface, but there are a bunch of AFS tools that do cache management using the path-based ioctl() (pioctl()) interface (which is not popular with Linux developers). Jeff Layton wondered if it was mostly for disconnected operation, but Howells said there are other uses for it that are "all cache-related"; he said that it was a matter of "how many birds I can kill with one stone".

The command-string interface (cmd) worried some as well. Josef Bacik thought that using the netlink interface made more sense than creating a new system call that would parse a command string. Howells did not want to have multiple system calls, so the command string is meant to avoid that. Bacik said that while netlink looks worrisome, it is actually really nice to use. Howells said he would look into netlink instead.

Comments (none posted)

Overlayfs features

By Jake Edge
March 29, 2017

LSFMM 2017

The overlayfs filesystem is being used more and more these days, especially in conjunction with containers. Amir Goldstein and Miklos Szeredi led a discussion about recent and upcoming features for the filesystem at LSFMM 2017.

Goldstein said that he went back to the 4.7 kernel to look at what has been added since then for overlayfs. There has been a fair amount of work in adding support for unprivileged containers. 4.8 saw the addition of SELinux support, while 4.9 added POSIX access-control lists (ACLs) and fixed file locks. 4.10 added support for cloning a file instead of copying it up on filesystems that support cloning (e.g. XFS).

There is ongoing work on using overlayfs to provide snapshots of directory trees on XFS. It is not clear when that will be merged, but 4.11 should see the addition of a parallel copy-up operation that should speed that operation up on filesystems that do not support cloning.

Another feature that is coming, perhaps in the 4.12 time frame, is to handle the case where an application gets inconsistent data because a copy up has occurred. Szeredi explained that if an application opens a file in the lower layer that gets copied up due to a write from some other program, the application will get only old data because it will still have that lower-layer file open. There are plans to change the read() and mmap() paths to check if a file has been copied up and change the kernel's view of the file to point at the new file.

But Al Viro was concerned that it would change a fundamental behavior that applications expect. If a world-readable file is opened, then has its permission changed to exclude the reader (which causes a copy up), the application would not expect errors at that point, but this solution would change that. Szeredi suggested that the open of the upper file could be done without permission checks, which Viro thought might work for some local filesystems, but not for upper layers on remote filesystems.

But Bruce Fields wondered if the behavior could even be changed the way Szeredi described. There could be applications that rely on the current behavior, or else no one is really using overlayfs. Viro said that he didn't believe any applications use the behavior. But, he noted, he has broken things in the past that didn't surface and have bugs filed until years later when users actually started testing their applications with the broken kernels.

Szeredi pointed out that these changes will make overlayfs more POSIX compliant and that there are other changes to that end that are coming. Fields is still concerned that the semantics are going to change in subtle ways over the next few years while people are actually using the filesystem. If people use it enough, there will be bugs filed from changing the behavior. But Jeff Layton said that even if it were noticed in some applications, it would be hard to argue against bringing overlayfs into POSIX compliance.

Goldstein said that there have also been a lot of improvements in the overlayfs test suite. There is support for running those tests from xfstests, so he asked the assembled filesystem developers to run them on top of their filesystems. He also mentioned overlayfs snapshots, which kind of turns overlayfs on its head, making the upper layer into a snapshot, while the lower layer is allowed to change. Any modifications to the lower-layer objects cause a copy-up operation to preserve the contents prior to the change, while any file-creation operation causes a whiteout in the snapshot. So when the lower layer is viewed through the snapshot, it appears just as the filesystem did at snapshot time.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.11-rc4 Mar 26

Greg KH Linux 4.10.6 Mar 27

Greg KH Linux 4.9.18 Mar 27

Sebastian Andrzej Siewior v4.9.18-rt14 Mar 28

Greg KH Linux 4.4.57 Mar 27

Architecture-specific

AKASHI Takahiro arm64: add kdump support Mar 28

Pavel Tatashin Early boot time stamps for x86 Mar 24

Borislav Petkov x86/RAS: Correctable Errors Collector Mar 27

Kirill A. Shutemov x86: 5-level paging enabling for v4.12, Part 3 Mar 27

Core kernel code

Byungchul Park sched/deadline: Return the best satisfying affinity and dl in cpudl_find Mar 23

luca abeni CPU reclaiming for SCHED_DEADLINE Mar 24

Juri Lelli SCHED_DEADLINE freq/cpu invariance and OPP selection Mar 24

Alexander Duyck Add busy poll support for epoll Mar 23

Matthew Wilcox Add memsetN functions Mar 24

Martin KaFai Lau bpf: Add map-in-map support Mar 22

Sergey Senozhatsky printk: introduce printing kernel thread Mar 29

Al Viro uaccess unification Mar 29

Development tools

Satoru Takeuchi elkdat: an easy linux kernel development and test tool Mar 24

Dmitry Vyukov x86, kasan: add KASAN checks to atomic operations Mar 28

Namhyung Kim ftrace: Add 'function-fork' trace option (v1) Mar 29

Device drivers

Bjorn Andersson leds: Add driver for Qualcomm LPG Mar 22

Matt Redfearn MIPS: Remote processor driver Mar 23

Jaghathiswari Rankappagounder Natarajan Support for ASPEED AST2400/AST2500 PWM and Fan Tach driver Mar 24

Nicolas Pitre minitty: a minimal TTY layer alternative for embedded systems Mar 23

Krzysztof Kozlowski crypto: hw_random - Add new Exynos RNG driver Mar 24

Jacopo Mondi Renesas RZ/A1 pin and gpio controller Mar 24

Alan Tull fpga: Xilinx LogiCore PR Decoupler Mar 24

Elaine Zhang rk808: Add RK805 support Mar 27

Nandor Han XRA1403,gpio - add XRA1403 gpio expander driver Mar 27

michael.hennerich@analog.com iio:adc: Driver for Linear Technology LTC2497 ADC Mar 27

Jacopo Mondi iio: adc: Maxim max9611 driver Mar 23

Akinobu Mita iio: adc: add max1117/max1118/max1119 ADC driver Mar 29

Steve Twiss da9061: DA9061 driver submission Mar 27

Ludovic Barre mtd: spi-nor: add stm32 qspi driver Mar 27

Andi Shyti STM FingerTip S touchscreen support for TM2 board Mar 27

Arnaud Pouliquen Add STM32 DFSDM support Mar 17

Peter Rosin mux controller abstraction and iio/i2c muxes Mar 27

Stefan Wahren net: qualcomm: add QCA7000 UART driver Mar 27

Jack Wang INFINIBAND NETWORK BLOCK DEVICE (IBNBD) Mar 24

Alex Deucher Add Vega10 Support Mar 20

Horia Geantă crypto: caam - add Queue Interface (QI) support Mar 17

Ralph Sennhauser gpio: mvebu: Add PWM fan support Mar 27

Boris Brezillon gpio: Add a driver for Cadence GPIO controller Mar 29

Steve Longerbeam i.MX Media Driver Mar 27

Brendan Higgins i2c: aspeed: added driver for Aspeed I2C Mar 27

Sebastian Reichel i2c: add sc18is600 driver Mar 29

Joel Stanley drivers: serial: Aspeed VUART driver Mar 28

Andrey Smirnov GPCv2 power gating driver Mar 28

Sebastian Reichel Nokia H4+ support Mar 28

sean.wang@mediatek.com net-next: dsa: add Mediatek MT7530 support Mar 29

Icenowy Zheng Add support for the R_CCU on Allwinner H3/A64 SoCs Mar 29

Marc Gonzalez Tango PCIe controller support Mar 29

Dave Gerlach memory: Introduce ti-emif-sram driver Mar 28

Anup Patel Broadcom FlexRM ring manager support Mar 29

Hugues Fruchet [PATCH v1 0/8] Add support for DCMI camera interface of STMicroelectronics STM32 SoC series Mar 29

Antoine Tenart arm64: marvell: add cryptographic engine support for 7k/8k Mar 29

Daniel Scheller stv0367/ddbridge: support CTv6/FlexCT hardware Mar 29

Christopher Bostic FSI device driver implementation Mar 29

Icenowy Zheng Initial Allwinner Display Engine 2.0 Support Mar 30

Device driver infrastructure

Kishon Vijay Abraham I PCI: Support for configurable PCI endpoint Mar 27

Sakari Ailus ACPI graph support Mar 16

Peter Pan Introduction to SPI NAND framework Mar 16

Jon Hunter PM / Domains: Add support for explicit control of PM domains Mar 28

Documentation

sayli karnik [PATCH v2] Documentation: Add flexible-arrays.rst to the documentation tree Mar 30

Filesystems and block I/O

Goldwyn Rodrigues No wait AIO Mar 15

Qu Wenruo Btrfs In-band De-duplication Mar 16

Omar Sandoval blk-mq: multiqueue I/O scheduler Mar 17

Shaohua Li blk-throttle: add .low limit Mar 27

Memory management

Pavel Tatashin parallelized "struct page" zeroing Mar 23

Huang, Ying THP swap: Delay splitting THP during swapping out Mar 28

Security-related

Mickaël Salaün Landlock LSM: Toward unprivileged sandboxing Mar 29

Kees Cook Introduce rare_write() infrastructure Mar 29

Miscellaneous

John W. Linville ethtool 4.10 released Mar 24

Page editor: Jonathan Corbet
Next page: Distributions>>

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Kernel development

Brief items

Kernel development news

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Security-related

Miscellaneous

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.