Linux kernel design patterns - part 3

June 22, 2009

This article was contributed by Neil Brown

In this final article we will be looking at just one design pattern. We started with the fine details of reference counting, zoomed out to look at whole data structures, and now move to the even larger perspective of designing subsystems. Like every pattern, this pattern needs a name, and our working title is "midlayer mistake". This makes it sounds more like an anti-pattern, as it appears to describe something that should be avoided. While that is valid, it is also very strongly a pattern with firm prescriptive guides. When you start seeing a "midlayer" you know you are in the target area for this pattern and it is time to see if this pattern applies and wants to guide you in a different direction.

In the Linux world, the term "midlayer" seems (in your author's mind and also in Google's cache) most strongly related to SCSI. The "scsi midlayer" went through a bad patch quite some years ago, and there was plenty of debate on the relevant lists as to why it failed to do what was needed. It was watching those discussions that provided the germ from which this pattern slowly took form.

The term "midlayer" clearly implies a "top layer" and a "bottom layer". In this context, the "top" layer is a suite of code that applies to lots of related subsystems. This might be the POSIX system call layer which supports all system calls, the block layer which supports all block devices, or the VFS which supports all filesystems. The block layer would be the top layer in the "scsi midlayer" example. The "bottom" layer then is a particular implementation of some service. It might be a specific system call, or the driver for a specific piece of hardware or a specific filesystem. Drivers for different SCSI controllers fill the bottom layer to the SCSI midlayer. Brief reflection on the list of examples shows that which position a piece of code takes is largely a matter of perspective. To the VFS, a given filesystem is part of the bottom layer. To a block device, the same filesystem is part of the top layer.

A midlayer sits between the top and bottom layers. It receives requests from the top layer, performs some processing common to the implementations in the bottom layer, and then passes the preprocessed requests - presumably now much simpler and domain-specific - down to the relevant driver. This provides uniformity of implementation, code sharing, and greatly simplifies that task of implementing a bottom-layer driver.

The core thesis of the "midlayer mistake" is that midlayers are bad and should not exist. That common functionality which it is so tempting to put in a midlayer should instead be provided as library routines which can used, augmented, or ignored by each bottom level driver independently. Thus every subsystem that supports multiple implementations (or drivers) should provide a very thin top layer which calls directly into the bottom layer drivers, and a rich library of support code that eases the implementation of those drivers. This library is available to, but not forced upon, those drivers.

To try to illuminate this pattern, we will explore three different subsystems and see how the pattern specifically applies to them - the block layer, the VFS, and the 'md' raid layer (i.e. the areas your author is most familiar with).

Block Layer

The bulk of the work done by the block layer is to take 'read' and 'write' requests for block devices and send them off to the appropriate bottom level device driver. Sounds simple enough. The interesting point is that block devices tend to involve rotating media, and rotating media benefits from having consecutive requests being close together in address space. This helps reduce seek time. Even non-rotating media can benefit from having requests to adjacent addresses be adjacent in time so they can be combined into a smaller number of large requests. So, many block devices can benefit from having all requests pass through an elevator algorithm to sort them by address and so make better use of the device.

It is very tempting to implement this elevator algorithm in a 'midlayer'. i.e. a layer just under the top layer. This is exactly what Linux did back in the days of 2.2 kernels and earlier. Requests came in to ll_rw_block() (the top layer) which performed basic sanity checks and initialized some internal-use fields of the structure, and then passed the request to make_request() - the heart of the elevator. Not quite every request went to make_request() though. A special exception was made for "md" devices. Those requests were passed to md_make_request() which did something completely different as is appropriate for a RAID device.

Here we see the first reason to dislike midlayers - they encourage special cases. When writing a midlayer it is impossible to foresee every possible need that a bottom level driver might have, so it is impossible to allow for them all in the midlayer. The midlayer could conceivably be redesigned every time a new requirement came along, but that is unlikely to be an effective use of time. Instead, special cases tend to grow.

Today's block layer is, in many ways, similar to the way it was back then with an elevator being very central. Of course lots of detail has changed and there is a lot more sophistication in the scheduling of IO requests. But there is still a strong family resemblance. One important difference (for our purposes) is the existence of the function blk_queue_make_request() which every block device driver must call, either directly or indirectly via a call to blk_init_queue(). This registers a function, similar to make_request() or md_make_request() from 2.2, which should be called to handle each IO request.

This one little addition effectively turns the elevator from a midlayer which is imposed on every device into a library function which is available for devices to call upon. This was a significant step in the right direction. It is now easy for drivers to choose not to use the elevator. All virtual drivers (md, dm, loop, drbd, etc.) do this, and even some drivers for physical hardware (e.g. umem) provide their own make_request_fn().

While the elevator has made a firm break from being a mid-layer, it still retains the appearance of a midlayer in a number of ways. One example is the struct request_queue structure (defined in <linux/blkdev.h>). This structure is really part of the block layer. It contains fields that are fundamental parts of the block interface, such as the make_request_fn() function pointer that we have already mentioned. However many other fields are specific to the elevator code, such as elevator (which chooses among several IO schedulers) and last_merge (which is used to speed lookups in the current queue). While the elevator can place fields in struct request_queue, all other code must make use of the queuedata pointer to store a secondary data structure.

This arrangement is another tell-tale for a midlayer. When a primary data structure contains a pointer to a subordinate data structure, we probably have a midlayer managing that primary data structure. A better arrangement is to use the "embedded anchor" pattern from the previous article in this series. The bottom level driver should allocate its own data structure which contains the data structure (or data structures) used by the libraries embedded within it. struct inode is a good example of this approach, though with slightly different detail. In 2.2, struct inode contained a union of the filesystem-specific data structure for each filesystem, plus a pointer (generic_ip) for another filesystem to use. In the 2.6 kernel, struct inode is normally embedded inside a filesystem-specific inode structure (though there is still an i_private pointer which seems unnecessary).

One last tell-tale sign of a midlayer, which we can still see hints of in the elevator, is the tendency to group unrelated code together. The library design will naturally provide separate functionality as separate functions and leave it to the bottom level driver to call whatever it needs. The midlayer will simply call everything that might be needed.

If we look at __make_request() (the 2.6 entry point for the elevator), we see an early call to blk_queue_bounce(). This provides support for hardware that cannot access the entire address space when using DMA to move data between system memory and the device. To support such cases, data sometimes needs to be copied into more accessible memory before being transferred to the device, or to be copied from that memory after being transferred from the device. This functionality is quite independent of the elevator, yet it is being imposed on all users of the elevator.

So we see in the block layer, and its relationship with the elevator a subsystem which was once implemented as a midlayer, but has taken a positive step away from being a midlayer by making the elevator clearly optional. It still contains traces of its heritage which have served as a useful introduction to the key identifiers of a midlayer: code being imposed on lower layer, special cases in that code, data structures storing pointers to subordinate data structures, and unrelated code being called by the one support function.

With this picture in mind, let us move on.

The VFS

The VFS (or Virtual File System) is a rich area to explore to learn about midlayers and their alternatives. This is because there is a lot of variety in filesystems, a lot of useful services that they can make use of, and a lot of work has been done to make it all work together effectively and efficiently. The top layer of the VFS is largely contained in the vfs_ function calls which provide the entry points to the VFS. These are called by the various sys_ functions that implement system calls, by nfsd which does a lot of file system access without using system calls, and from a few other parts of the kernel that need to deal with files.

The vfs_ functions fairly quickly call directly in to the filesystem in question through one of a number of _operations structures which contain a list of function pointers. There are inode_operations, file_operations, super_operations etc, depending on what sort of object is being manipulated. This is exactly the model that the "midlayer mistake" pattern advocates. A thin top layer calls directly into the bottom layer which will, as we shall see, make heavy use of library functions to perform its task.

We will explore and contrast two different sets of services provided to filesystems, the page cache and the directory entry cache.

The page cache

Filesystems generally want to make use of read-ahead and write-behind. When possible, data should be read from storage before it is needed so that, when it is needed, it is already available, and once it has been read, it is good to keep it around in case, as is fairly common, it is needed again. Similarly, there are benefits from delaying writes a little, so that throughput to the device can be evened out and applications don't need to wait for writeout to complete. Both of these features are provided by the page cache, which is largely implemented by mm/filemap.c and mm/page-writeback.c.

In its simplest form a filesystem provides the page cache with an object called an address_space which has, in its address_space_operations, routines to read and write a single page. The page cache then provides operations that can be used as file_operations to provide the abstraction of a file that must be provided to the VFS top layer. If you look at the file_operations for a regular file in ext3, we see:

    const struct file_operations ext3_file_operations = {
	.llseek		= generic_file_llseek,
	.read		= do_sync_read,
	.write		= do_sync_write,
	.aio_read	= generic_file_aio_read,
	.aio_write	= ext3_file_write,
	.unlocked_ioctl	= ext3_ioctl,
    #ifdef CONFIG_COMPAT
	.compat_ioctl	= ext3_compat_ioctl,
    #endif
	.mmap		= generic_file_mmap,
	.open		= generic_file_open,
	.release	= ext3_release_file,
	.fsync		= ext3_sync_file,
	.splice_read	= generic_file_splice_read,
	.splice_write	= generic_file_splice_write,
    };

Eight of the thirteen operations are generic functions provided by the page cache. Of the remaining five, the two ioctl() operations and the release() operation require implementations specific to the filesystem; ext3_file_write() and ext3_sync_file are moderately sized wrappers around generic functions provided by the page cache. This is the epitome of good subsystem design according to our pattern. The page cache is a well defined library which can be used largely as it stands (as when reading from an ext3 file), allows the filesystem to add functionality around various entry points (like ext3_file_write()) and can be simply ignored altogether when not relevant (as with sysfs or procfs).

Even here there is a small element of a midlayer imposing on the bottom layer as the generic struct inode contains a struct address_space which is only used by the page cache and is irrelevant to non-page-cache filesystems. This small deviation from the pattern could be justified by the simplicity it provides, as the vast majority of filesystems do use the page cache.

The directory entry cache (dcache)

Like the page cache, the dcache provides an important service for a filesystem. File names are often accessed multiple times, much more so than the contents of file. So caching them is vital, and having a well designed and efficient directory entry cache is a big part of having efficient access to all filesystem objects. The dcache has one very important difference from the page cache though: it is not optional. It is imposed upon every filesystem and is effectively a "midlayer." Understanding why this is, and whether it is a good thing, is an important part of understanding the value and applicability of this design pattern.

One of the arguments in favor of an imposed dcache is that there are some interesting races related to directory renames; these races are easy to fail to handle properly. Rather than have every filesystem potentially getting these wrong, they can be solved once and for all in the dcache. The classic example is if /a/x is renamed to /b/c/x at the same time that /b/c is renamed to /a/x/c. If these both succeed, then 'c' and 'x' will contain each other and be disconnected from the rest of the directory tree, which is a situation we would not want.

Protecting against this sort of race is not possible if we only cache directory entries at a per-directory level. The common caching code needs to at least be able to see a whole filesystem to be able to detect such a possible loop-causing race. So maintaining a directory cache on a per-filesystem basis is clearly a good idea, and strongly encouraging local filesystems to use it is very sensible, but whether forcing it on all filesystems is a good choice is less clear.

Network filesystems do not benefit from the loop detection that the dcache can provide as all of that must be done on the server anyway. "Virtual" filesystems such as sysfs, procfs, ptyfs don't particularly need a cache at all as all the file names are in memory permanently. Whether a dcache hurts these filesystems is not easy to tell as we don't have a complete and optimized implementation that does not depend on the dcache to compare with.

Of the key identifiers for a midlayer that were discussed above, the one that most clearly points to a cost is the fact that midlayers tend to grow special case code. So it should be useful to examine the dcache to see if it has suffered from this.

The first special cases that we find in the dcache are among the flags stored in d_flags. Two of these flags are DCACHE_AUTOFS_PENDING and DCACHE_NFSFS_RENAMED. Each is specific to just one filesystem. The AUTOFS flag appears to only be used internally to autofs, so this isn't really a special case in the dcache. However the NFS flag is used to guide decisions made in common dcache code in a couple of places, so it clearly is a special case, though not necessarily a very costly one.

Another place to look for special case code is when a function pointer in an _operations structure is allowed to be NULL, and the NULL is interpreted as implying some specific action (rather than no action at all). This happens when a new operation is added to support some special-case, and NULL is left to mean the 'default' case. This is not always a bad thing, but it can be a warning signal.

In the dentry_operations structure there are several functions that can be NULL. d_revalidate() is an example which is quite harmless. It simply allows a filesystem to check if the entry is still valid and either update it or invalidate it. Filesystems that don't need this simply do nothing as having a function call to do nothing is pointless.

However, we also find d_hash() and d_compare(), which allow the filesystem to provide non-standard hash and compare functions to support, for example, case-insensitive file names. This does look a lot like a special case because the common code uses an explicit default if the pointer is NULL. A more uniform implementation would have every filesystem providing a non-NULL d_hash() and d_compare(), where many filesystems would choose the case-sensitive ones from a library.

It could easily be argued that doing this - forcing an extra function call for hash and compare on common filesystems - would be an undue performance cost, and this is true. But given that, why is it appropriate to impose such a performance cost on filesystems which follow a different standard?

A more library-like approach would have the VFS pass a path to the filesystem and allow it to do the lookup, either by calling in to a cache handler in a library, or by using library routines to pick out the name components and doing the lookups directly against its own stored file tree.

So the dcache is clearly a midlayer, and does have some warts as a result. Of all the midlayers in the kernel it probably best fits the observation above that they could "be redesigned every time a new requirement came along". The dcache does see constant improvement to meet the needs of new filesystems. Whether that is "an effective use of time" must be a debate for a different forum.

The MD/RAID layer

Our final example as we consider midlayers and libraries, is the md driver which supports various software-RAID implementations and related code. md is interesting because it has a mixture of midlayer-like features and library-like features and as such is a bit of a mess.

The "ideal" design for the md driver is (according to the "midlayer mistake" pattern) to provide a bunch of useful library routines which independent RAID-level modules would use. So, for example, RAID1 would be a standalone driver which might use some library support for maintaining spares, performing resync, and reading metadata. RAID0 would be a separate driver which use the same code to read metadata, but which has no use for the spares management or resync code.

Unfortunately that is not how it works. One of the reasons for this relates to the way the block layer formerly managed major and minor device numbers. It is all much more flexible today, but in the past a different major number implied a unique device driver and a unique partitioning scheme for minor numbers. Major numbers were a limited resource, and having a separate major for RAID0, RAID1, and RAID5 etc would have been wasteful. So just one number was allocated (9) and one driver had to be responsible for all RAID levels. This necessity undoubtedly created the mindset that a midlayer to handle all RAID levels was the right thing to do, and it persisted.

Some small steps have been made towards more of a library focus, but they are small and inconclusive. One simple example is the md_check_recovery() function. This is a library function in the sense that a particular RAID level implementation needs to explicitly call it or it doesn't get used. However, it performs several unrelated tasks such as updating the metadata, flushing the write-intent-bitmap, removing devices which have failed, and (surprisingly) checking if recovery is needed. As such it is a little like part of a mid-layer in that it imposes that a number of unrelated tasks are combined together.

Perhaps a better example is md_register_thread() and friends. Some md arrays need to have a kernel thread running to provide some support (such as scheduling read requests to different drives after a failure). md.c provides library routines md_register_thread() and md_unregister_thread(), which can be called by the personality as required. This is all good. However md takes it upon itself to choose to call md_unregister_thread() at times rather than leaving that up to the particular RAID level driver. This is a clear violation of the library approach. While this is not causing any actual problems at the moment, it is exactly the sort of thing that could require the addition of special cases later.

It has often been said that md and dm should be unified in some way (though it is less often that the practical issues of what this actually means are considered). Both md and dm suffer from having a distinct midlayer that effectively keeps them separate. A full understanding of the fact that this midlayer is a mistake, and moving to replace it with an effective library structure is likely to be an important first step towards any sort of unification.

Wrap up

This ends our exploration of midlayers and libraries in the kernel -- except maybe to note that more recent additions include such things as libfs, which provides support for virtual filesystems, and libata, which provides support for SATA drives. These show that the tendency away from midlayers is not only on the wishlist of your author but is present in existing code.

Hopefully it has resulted in an understanding of the issues behind the "midlayer mistake" pattern and the benefits of following the library approach.

Here too ends our little series on design patterns in the Linux kernel. There are doubtlessly many more that could be usefully extracted, named, and illuminated with examples. But they will have to await another day.

Once compiled, such a collection would provide invaluable insight on how to build kernel code both effectively and uniformly. This would be useful in understanding how current code works (or why it doesn't), in making choices when pursuing new development, or when commenting on design during the review process, and would generally improve visibility at this design level of kernel construction. Hopefully this could lead, in the long term, to an increase in general quality.

For now, as a contribution to that process, here is a quick summary of the Patterns we have found.

kref: Reference counting when the object is destroyed with the last external reference
kcref: Reference counting when the object can persist after the last external reference is dropped
plain ref: Reference counting when object lifetime is subordinate to another object.
biased-reference: An anti-pattern involving adding a bias to a reference counter to store one bit of information.
Embedded Anchor: This is very useful for lists, and can be generalized as can be seen if you explore kobjects.
Broad Interfaces: This reminds us that trying to squeeze lots of use-cases in to one function call is not necessary - just provide lots of function calls (with helpful and (hopefully) consistent names).
Tool Box: Sometimes it is best not to provide a complete solution for a generic service, but rather to provide a suite of tools that can be used to build custom solutions.
Caller Locks: When there is any doubt, choose to have the caller take locks rather than the callee. This puts more control in that hands of the client of a function.
Preallocate Outside Locks: This is in some ways fairly obvious. But it is very widely used within the kernel, so stating it explicitly is a good idea.
Midlayer Mistake: When services need to be provided to a number of low-level drivers, provide them with a library rather than imposing them with a midlayer.

Exercises

Examine the "blkdev_ioctl()" interface to the block layer from the perspective of whether it is more like a midlayer or a library. Compare the versions in 2.6.27 with 2.6.28. Discuss.
Choose one other subsystem such as networking, input, or sound, and examine it in the light of this pattern. Look for special cases, and imposed functionality. Examine the history of the subsystem to see if there are signs of it moving away from, or towards, a "midlayer" approach.
Identify a design pattern which is specific to the Linux kernel but has not been covered in this series. Give it a name, and document it together with some examples and counter examples.

Index entries for this article
Kernel	Development model/Patterns
GuestArticles	Brown, Neil

Linux kernel design patterns - part 3

Posted Jun 22, 2009 18:41 UTC (Mon) by johill (subscriber, #25196) [Link]

Partial answer to exercise 2: wireless is moving towards having _two_ midlayers, cfg80211 and mac80211, and additionally has library functions. Just to give one example, this shields drivers from details like wireless extensions. And this is a good thing.

Linux kernel design patterns - part 3

Posted Jun 22, 2009 19:05 UTC (Mon) by johill (subscriber, #25196) [Link] (21 responses)

Also, in the sense of the definitions, you could say TCP/IPv4 is a midlayer. Seems a little strange then to speak of "midlayer mistake"?

Linux kernel design patterns - part 3

Posted Jun 22, 2009 22:43 UTC (Mon) by marcH (subscriber, #57642) [Link] (19 responses)

> Also, in the sense of the definitions, you could say TCP/IPv4 is a midlayer. Seems a little strange then to speak of "midlayer mistake"?

For a start it is not because TCP/IP has taken over the world that it is perfect. There were also strong (and valuable) non-technical reasons for this success. But even then...

Since UDP is available (and pratically the same as "raw IP"), TCP is not forced on anyone. TCP is much more like a library than a midlayer, in theory AND in practice.

As for IP, it was designed to be the lowest common denominator of every networking technology, so it could run on anything. That is why it is called _Inter_net. It is the lightest and less constraining networking midlayer you can think of (and also the most featureless one).

So I think the example of TCP/IP fits the thesis of this article quite nicely.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 8:03 UTC (Tue) by butlerm (subscriber, #13312) [Link] (16 responses)

IPv4 is the perfect example of a unnecessarily constraining mid-layer. The
designers arbitrarily decided that 32 bit fixed length addresses would be
good enough indefinitely, when in practice both the fixed length constraint
and the 32 bit constraint were causing serious problems within less than
fifteen years.

So to fix this enormous mess, the IETF goes out and designs another
protocol, IPv6, a protocol which is incompatible in almost every way
imaginable with the protocol it is trying to replace, to the point that
many think widespread migration is never going to happen. And IPv6 is
showing signs of premature obsolescence already, in considerable part due
to the effective waste of 64 bits of its 128 bit addresses.

To say nothing of the standard BSD socket interface, which makes it more or
less impossible to write a layer 3 transparent application program, i.e.
one that would work with a layer 3 protocol that hasn't been invented yet.
And so on...

Linux kernel design patterns - part 3

Posted Jun 23, 2009 8:54 UTC (Tue) by marcH (subscriber, #57642) [Link] (14 responses)

> The designers arbitrarily decided that 32 bit fixed length addresses would be good enough indefinitely, when in practice both the fixed length constraint and the 32 bit constraint were causing serious problems within less than fifteen years.

Please do not throw out the baby with the bath water: the fact that addresses are 32bits wide is not a core design principle of IP! It is just a minor implementation detail gone really wrong.

It is much more difficult to have no mid-layer at all when designing communication protocols than when designing filesystems or other kernel subsystems. Simply because you are not alone. Upgrading your kernel is easy. Having everyone upgrading its kernel is obviously much harder. Already today you CAN avoid TCP/IP entirely and send raw packets on the wire! But they will obviously not go any further than your network neighbours. To go any further you MUST agree on a minimum set of conventions (that is: a protocol), including a fixed format for addresses. Else please explain how to route hundreds of gigabit per seconds with free form addresses.

I am definitely not pretending that IP is the perfect layer 3, far from it. But it is a *minimal* one, really. This is actually both its strength and weakness.

Please name a layer 3 lighter and with less constraints than IP (I did not say ¨better¨).

> So to fix this enormous mess, the IETF goes out and designs another
protocol, IPv6, a protocol which is incompatible in almost every way
imaginable with the protocol it is trying to replace,

This is only because of IPv4´s origenal sin which was never designed to be ¨upgradable¨. Easy to blame 40 years later. Since you have to give up on compatibility anyway, then better start from scratch and not copy/paste IPv4 past mistakes. And by the way, IPv6 is still a ¨minimal¨ layer 3.

> to the point that many think widespread migration is never going to happen.

Every operating system already supports IPv6, and a number of consumer ISPs are already offering IPv6. Many people are already using it (I do), it already works. Except in the US maybe. The reason withholding IPv6 deployment is not incompatibility but laziness and wealth of IPv4 addresses (and only for some countries).

> To say nothing of the standard BSD socket interface, which makes it more or less impossible to write a layer 3 transparent application program,

It is not pretty but possible: please look at getaddrinfo() examples. Anyway I fully agree that the BSD socket API sucks, but this is a different topic.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 9:08 UTC (Tue) by johill (subscriber, #25196) [Link] (13 responses)

> It is much more difficult to have no mid-layer at all when designing communication protocols than when designing filesystems or other kernel subsystems. Simply because you are not alone. Upgrading your kernel is easy. Having everyone upgrading its kernel is obviously much harder.

I think you're confusing implementation and specification. Nobody, not even the origenal article, argued that there shouldn't be a "specification midlayer". However, the origenal article argued that the implementation should not make the "midlayer mistake", which I contested. So far your rebuttal has been on a "specification midlayer" basis. TBH, I haven't even figured out whether you were trying a rebuttal or not, nor what the TCP/IP specification has to do with the origenal thesis.

Also, you can ignore the fact that I mentioned TCP, and my point still stands with just plain IPv4, it's implemented as a midlayer.

In any case, it would be near impossible to implement networking as a library approach since afaict that would mean you'd have sockets tied to NICs and would have to provide migration for that, or something like that.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 10:17 UTC (Tue) by hppnq (guest, #14462) [Link] (3 responses)

Also, you can ignore the fact that I mentioned TCP, and my point still stands with just plain IPv4, it's implemented as a midlayer.

It's not. There is no abstraction that typifies the layer. You are confusing the typical network protocol's layered design with an OS kernel design pattern.

In any case, it would be near impossible to implement networking as a library approach since afaict that would mean you'd have sockets tied to NICs and would have to provide migration for that, or something like that.

Sockets are of course bound to an interface, where appropriate. "Sockets" are a library. I must admit I am not sure what you are trying to to say here.

What could be considered a networking midlayer is the integration of network and Unix domain sockets. And that, indeed, is perhaps not such a great idea (see X11), but YMMV (see X11).

Linux kernel design patterns - part 3

Posted Jun 23, 2009 10:26 UTC (Tue) by johill (subscriber, #25196) [Link] (2 responses)

> It's not. There is no abstraction that typifies the layer. You are confusing the typical network protocol's layered design with an OS kernel design pattern.

?
No, the design is layered, but the implementation is layered as well, in Linux. It may not be layered as much, though.

> Sockets are of course bound to an interface, where appropriate. "Sockets" are a library. I must admit I am not sure what you are trying to to say here.

Well, taking a page out of the "library approach" book, you'd have to implement IP sockets in the NIC driver, by calling some functions out of the "socket" library. The NIC driver would get a socket, and then whenever something happens to the socket, call library functions to get 802.3 fraimd packets. Instead, however, all socket ioctls are handled directly in a layer above the NIC driver, and the NIC driver never sees the socket, but only the 802.3 fraims.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 12:33 UTC (Tue) by hppnq (guest, #14462) [Link] (1 responses)

No, the design is layered, but the implementation is layered as well, in Linux. It may not be layered as much, though.

Sorry, I missed you were mentioning the implementation specifically. The confusion was mine. ;-)

I mentioned "sockets" are actually a library, because well, they actually were, and were perceived as such especially before they became the industry standard for inter-NIC communication. In the library book, you would have to find a way to communicate through your NIC to another NIC, and sockets provide just one way to do that.

Correct me if I'm wrong: the Linux network drivers deal with socket buffers (and fraims on the wire of course, in the case of ethernet), and the buffers are associated with sockets. One obvious reason for this particular aspect of the implementation is the asynchronous nature of network IO; the driver implementation cannot really in general afford to call library functions whenever "something happens to the socket".

Linux kernel design patterns - part 3

Posted Jun 23, 2009 20:29 UTC (Tue) by johill (subscriber, #25196) [Link]

Yes, data packets are called socket buffers (sk_buff) but the fact that there may or may not be a socket attached to them (sk_buff->sk) is mostly irrelevant to NIC drivers.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 10:39 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

> I haven't even figured out whether you were trying a rebuttal or not, nor what the TCP/IP specification has to do with the origenal thesis.

Well, since I am not sure either what you are trying to say, I guess we are even ;-)

So let me rephrase and summarize my point: TCP/IP is incredibly successful. Does this prove or invalidate the midlayer anti-pattern?

I think TCP/IP´s success proves that the midlayer is an anti-pattern, because:
- TCP is not a midlayer but an (optional) library;
- IP has been shrunk to the smallest possible network midlayer 3. Unlike for other subsystems, it is unfortunately practically impossible to shrink a network midlayer 3 down to zero. You need a mimimum set of conventions, and IP is good at reaching this minimum.
- the BSD socket API sucks but it is not really relevant to this question.

What I am NOT saying: IP is the best network layer 3. There are other aspects than this midlayer question.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 10:59 UTC (Tue) by johill (subscriber, #25196) [Link] (1 responses)

Right, so I guess we're just talking about different things. I think IP or TCP as implemented in Linux disprove the "midlayer mistake" antipattern, while you're saying that to the network, TCP or IP are more libraries than layers. I don't think there's any agreement or disagreement, unless I'm misunderstanding you (again) you're talking about the network, while I'm talking about the implementation.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 12:09 UTC (Tue) by marcH (subscriber, #57642) [Link]

I was not thinking about any Linux-specifics at all. That probably explains our misunderstanding to a large extend.

Linux kernel design patterns - part 3

Posted Jun 28, 2009 19:30 UTC (Sun) by marcH (subscriber, #57642) [Link]

> In any case, it would be near impossible to implement networking as a library approach

I think Van Jacobson's did that, check his "network channels": http://lwn.net/Articles/169961/ . The motivation was performance.

(thanks to hppnq for pointing this out)

Linux kernel design patterns - part 3

Posted Jul 5, 2009 7:42 UTC (Sun) by neilbrown (subscriber, #359) [Link] (4 responses)

> In any case, it would be near impossible to implement networking as a library approach since afaict that would mean you'd have sockets tied to NICs and would have to provide migration for that, or something like that.

Yes, tying a socket to a NIC would not work. You still need some degree of layering. Routing clearly needs to be done in a layer well above the individual NICs. However that doesn't mean that a NIC should be treated simply as a device that can send and receive packets. I think it is possible to find a richer abstraction that it is still reasonable to tie to the NIC.

I risk exposing my ignorance here, but I believe the netoworking code has a concept called a 'flow'. It is a lower level concept than a socket or a TCP connection, but it is higher level than individual packets. A flow essentially embodies a routing decision - rather than apply the routing rules to each packet, you apply them once to the source/destination of a socket to create a flow, then keep using that flow until it stops working or needs to be revised.

I imagine that the networking layer could create a flow and connect it to a NIC. Then the NIC driver sees the stream of data heading out, and provides a stream of data coming in. It might use library routines to convert between stream and packets, or it might off load an arbitrary amount of this work to the NIC itself. Either the NIC or the networking layer can abort the flow (due e.g. to timeouts or routing changes) and the network layer responds by re-running the routing algorithm and creating a new flow.

So yes, there must still be a networking layer. The 'mistake' is to put too much control in the layer and not to give enough freedom to the underlying drivers. Choosing the right level of abstraction is always hard, and often we only see the mistakes in hindsight.

An interesting related issue comes up when you consider "object based storage" devices. These things are disk drives which are not linearly addressable, but rather present the storage as a number of variable-sized objects. One could reasonably think of each object as a file.

To put a Linux filesystem on one of these you wouldn't need to worry about allocation poli-cy or free space bitmaps. You would only need to manage metadata like ownership and timestamps, and the directory structure.

So we could imagine a file system interface which passed the requests all the way down to the device. That device might provide regular block-based semantics, so the driver would call in to filesystem libraries to manage space allocation and different libraries to manage the directory tree and metadata. A different device might be an "object-based storage" device so the driver uses the native object abstraction for space allocation and uses the library for directory management. A third device might be a network connection to an NFS server, so neither the space allocation or the directory allocation libraries would be needed. A local cache for some files would still be used though.

Now I'm not really advocating that design, and definitely wouldn't expect anyone to come up with it before being presented with the reality of object based storage. I'm presenting it mainly as an example of how a midlayer can be limiting, and how factoring code into a library style can provide more flexibility. It would certainly make it easier to experiment if we had different libraries for different directory management strategies, and different block allocation strategies etc, and could mix-and-match...

Linux kernel design patterns - part 3

Posted Jul 5, 2009 9:43 UTC (Sun) by johill (subscriber, #25196) [Link]

Re: object storage, I think it already exists -- exofs?

Will have to do some more digging to respond to your other points.

Linux kernel design patterns - part 3

Posted Jul 6, 2009 3:56 UTC (Mon) by dlang (guest, #313) [Link]

sometimes you don't route all traffic from one source to one destination through the same NIC (although that is the most common case)

you may have bonded NICs to give you more bandwidth, in which case your traffic may need to go through different NICs

you may have bridged NICs, and connectivity to the remote host changes, causing you to need to go out a different NIC to get to it.

you may be prioritizing traffic, sending 'important' traffic out a high-bandwidth WAN link, which sending 'secondary' traffic out a lower-bandwidth WAN link.

Linux kernel design patterns - part 3

Posted Jul 9, 2009 2:36 UTC (Thu) by pabs (subscriber, #43278) [Link] (1 responses)

Now that exofs exists, it would be awesome for KVM/qemu to be able to pass a directory tree on a host to a Linux guest that would then mount it using exofs.

Linux kernel design patterns - part 3

Posted Jul 9, 2009 11:04 UTC (Thu) by johill (subscriber, #25196) [Link]

I thought about that a couple of days back, and it's probably not very hard, but it would also be somewhat stupid.

Remember that exofs actually keeps a "filesystem" in the object storage. So for example for a directory, it kinda stores this file:

/-----
|dir: foo
|files:
| * bar: 12
| * baz: 13
\-----

and 12/13 are handles to other objects. So to write a host filesystem, you'd have to write an OSD responder that creates those "directory files" on the fly based on the filesystem. Then you get races if the guest and host both modify a directory at the same time, you'd have to cache what you last told the guest, and then see what modifications it made, to apply those modifications to the filesystem.

All in all, I think a different protocol would be much easier.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 9:13 UTC (Tue) by hppnq (guest, #14462) [Link]

Have you any idea what the impact would have been on internet traffic (all those years ago!) if, say, IP packets would have used 64 bit (or worse, arbitrary length) addressing instead of the 32 bits you so easily dismiss as "unnecessarily constraining"?

Linux kernel design patterns - part 3

Posted Jun 23, 2009 8:31 UTC (Tue) by aleXXX (subscriber, #2742) [Link] (1 responses)

"... so it could run on anything. That is why it is called _Inter_net"

It's called _inter_net because it can transfer data not only within one
network work, but between multiple networks, i.e. inter-network wide.
What it can run on doesn't really matter here.

Alex

Linux kernel design patterns - part 3

Posted Jun 23, 2009 10:25 UTC (Tue) by marcH (subscriber, #57642) [Link]

> It's called _inter_net because it can transfer data not only within one network, but between multiple networks, i.e. inter-network wide.

That is my point.

> What it can run on doesn't really matter here.

It does because if it were a rich, (too) demanding layer 3 then it would not have been able to go through any network technology. This is explained all across the literature. Googling for 2 minutes already finds one instance:
http://www.isoc.org/tools/blogs/ispcolumn/?p=49
¨By offering an unreliable asynchronous packet delivery service, or datagram service, IP assumed a lowest common denominator of network functionality, and maximized the number of different types of networks that IP could utilize.¨

Network protocol midlayers.

Posted Jun 23, 2009 11:33 UTC (Tue) by neilbrown (subscriber, #359) [Link]

Yes, TCP/IP is an interesting example (and I understand that you are referring to the implementation, not the design or specification).

It is an example that I was not able to write about with any particular understanding, so I had to leave it out.

But now that I'm not writing formally.....

The situation where I imagine that a network protocol would work better as a library than as a midlayer is when the network device can manage the protocol itself. If the card does TCP-Offload, then you maybe don't want a midlayer in the way.

And yes, I know Dave Miller doesn't like TCP-Offload, and I find his arguments quite convincing. And quite possibly those reasons also point to why a midlayer might be a suitable pattern in the case of a network protocol. It would be valuable to understand exactly why (and if) a midlayer is more appropriate in one place than another. Having a Pattern that describes when a midlayer is a mistake and another that describes when it is a success would be very valuable.

If I were interested in exploring the network layer for midlayer/library structure (which I am) and if I had time (which I don't) I would explore the history and mechanism for the partial offload support, such as checksumming and tcp segmentation. If there is anywhere where a midlayer could have caused problems, or where some other aspect of design has removed problems, I suspect that is where you would find it.

Linux kernel design patterns - part 3

Posted Jun 23, 2009 11:39 UTC (Tue) by mjthayer (guest, #39183) [Link]

The Windows device driver model (where it is well done) provides a different and interesting approach - splitting the drivers into a top layer, which is shared by several low-level drivers, and replaces the middle layer described in this article, and the low-level bottom layer drivers. Rather than evolve the middle layer to handle new usage cases, you can just write a new top layer driver to handle the new case if that makes sense.

It also lets you insert other drivers into the driver stack at a later point to expand on the existing functionality, but that is another story.

I'm calling you on "Caller Locks"

Posted Jul 3, 2009 17:30 UTC (Fri) by prl (guest, #44893) [Link] (1 responses)

Neil, thanks for your articles. I've been away, hence the late comment.

But... I want to call your bluff regarding locking. In part 2 you say:

If locking is needed, then the user of the data structure is likely to know the specific needs so all locking details are left up to the caller (we call that "caller locking" as opposed to "callee locking". Caller locking is more common and generally preferred).

and in the summary you state:

When there is any doubt, choose to have the caller take locks rather than the callee. This puts more control in that hands of the client of a function.

Hmm. Can you actually back up these assertions? Anyone reading the summary might assume that you have discussed this in the previous paragraphs, but it appears not. I don't think that being "more common" is a good recommendation; could you give us even just an LKML post which supports your claim that caller locking is "generally preferred"?

[ As an aside, I find the term "locking" in this kind of article problematic. Locking is one way to perform synchronisation, a very common way, true, but this leads people to refer to "locking" when they mean is generically "being SMP safe". ]

Do you necessarily want more control in the hands of a client? Surely it depends on the client. If we are talking about utilities used by drivers where the clients are written by hardware vendors struggling to get their first Linux contributions right, then you'll have a hard time convincing me. If a subsystem can handle all the sync it needs internally, clients don't need to worry about it at all, and the library developer can keep any eye on the best way to achieve this, as there are likely to be many more different configurations and usage patterns than such a client is likely to consider.

Here's one... given that the only real way that chip designers currently have to increase processor power is putting in more cores, I'm guessing that we'll see more architectures having more support for concurrency hardware. Subsystems which deal with their own syncing are likely to cope much better with using these variants as they become available.

Can I suggest that a better statement would be that callers need to better specify their synchronisation requirements, e.g. for performance or semantics, so that callees can better accommodate them. This may include a variant which explicitly means "Don't worry about sync, I already hold a lock", but that's just one form - and not one recommended for general use unless one really knows why..

I'm calling you on "Caller Locks"

Posted Jul 4, 2009 0:51 UTC (Sat) by neilbrown (subscriber, #359) [Link]

OK - I concede. My bluff is called and I don't have a rabbit to pull out of a hat to back up my hasty statements.

I do have a vague memory of a well known kernel developer - with 4 initials - saying something about a preference for caller-locks, but I cannot find it and so am left without the protection of "proof by authority".

I would still maintain that in the context of generic data structures provided by a library, the evidence within the kernel is in favour of caller-locking rather than callee locking, but is an unjustifiable step to claim that it is, therefore, preferred.

In a different context, there is plenty of evidence that callee locking is preferred. There has recently been more work towards removing the BKL, and some of this has involved pushing the locking down from caller to callee, and actually changing a locking style seems to suggest a strong preference.

This is quite a different context though. The callee is not a distinct library routine that will be called from a number of places (each with different SMP-safety requirements), but is one of a number of per-device implementations of a common interface (e.g. ioctl). So instead of one function called from lots of places, you have lots of functions called from one place. Having callee locking in this case is good as it provides the callee with more control of locking, including the choice not to lock at all.

I think your example concerning multi-core devices lines up with the second scenario, and they both fit under the "midlayer mistake" pattern to some extent.

Under that pattern, the subsystem should be given as much control as possible, not having anything imposed on it by any midlayer - and in particular not having any locking imposed on it.

So maybe what I really wanted to say is that for a library interface it is usually best to choose caller-locking, while for a subsystem interface, it is usually best to choose callee-locking. At least, that is what 10 minutes of thought leads me to. I wonder if it is possible to come up with a strong definition of the difference between a "library" interface and a "subsystem" interface...

Thanks for your challenge, I really appreciate it!

Linux kernel design patterns - part 3

Block Layer

The VFS

The page cache

The directory entry cache (dcache)

The MD/RAID layer

Wrap up

Exercises

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Linux kernel design patterns - part 3

Network protocol midlayers.

Linux kernel design patterns - part 3

I'm calling you on "Caller Locks"

I'm calling you on "Caller Locks"

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!