The managed resource API

[Posted February 20, 2007 by corbet]

The device resource management patch was discussed here in January. That patch has now been merged for the 2.6.21 kernel. Since the API is now set - at least, as firmly as any in-kernel API is - it seems like a good time for a closer look at this new interface.

The core idea behind the resource management interface is that remembering to free allocated resources is hard. It appears to be especially hard for driver writers who, justly or not, have a reputation for adding more than their fair share of bugs to the kernel. And even the best driver writers can run into trouble in situations where device probing fails halfway through; the recovery paths may be there in the code, but they tend not to be well tested. The result of all this is a fair number of resource leaks in driver code.

To address this problem, Tejun Heo created a new set of resource allocation functions which track allocations made by the driver. These allocations are associated with the device structure; when the driver detaches from the device, any left-over allocations are cleaned up. The resource management interface is thus similar to the talloc() API used by the Samba hackers, but it is adapted to the kernel environment and covers more than just memory allocations.

Starting with memory allocations, though, the new API is:

    void *devm_kzalloc(struct device *dev, size_t size, gfp_t gfp);
    void devm_kfree(struct device *dev, void *p);

In a pattern we'll see repeated below, the new functions are similar to kzalloc() and kfree() except for the new names and the addition of the dev argument. That argument is necessary for the resource management code to know when the memory can be freed. If any memory allocations are still outstanding when the associated device is removed, they will all be freed at that time.

Note that there is no managed equivalent to kalloc(); if driver writers cannot be trusted to free memory, it seems, they cannot be trusted to initialize it either. There are also no managed versions of the page-level or slab allocation functions.

Managed versions of a subset of the DMA allocation functions have been provided:

    void *dmam_alloc_coherent(struct device *dev, size_t size,
			      dma_addr_t *dma_handle, gfp_t gfp);
    void dmam_free_coherent(struct device *dev, size_t size, void *vaddr,
			    dma_addr_t dma_handle);
    void *dmam_alloc_noncoherent(struct device *dev, size_t size,
			         dma_addr_t *dma_handle, gfp_t gfp);
    void dmam_free_noncoherent(struct device *dev, size_t size, void *vaddr,
			       dma_addr_t dma_handle);
    int dmam_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr,
				     dma_addr_t device_addr, size_t size, 
				     int flags);
    void dmam_release_declared_memory(struct device *dev);
    struct dma_pool *dmam_pool_create(const char *name, struct device *dev,
				      size_t size, size_t align,
				      size_t allocation);
    void dmam_pool_destroy(struct dma_pool *pool);

All of these functions have the same arguments and functionality as their dma_* equivalents, but they will clean up the DMA areas on device shutdown. One still has to hope that the driver has ensured that no DMA remains active on those areas, or unpleasant things could happen.

There is a managed version of pci_enable_device():

    int pcim_enable_device(struct pci_dev *pdev);

There is no pcim_disable_device(), however; code should just use pci_disable_device() as usual. A new function:

    void pcim_pin_device(struct pci_dev *pdev);

will cause the given pdev to be left enabled even after the driver detaches from it.

The patch makes the allocation of I/O memory regions with pci_request_region() managed by default - there is no pcim_ version of that interface. The higher-level allocation and mapping interfaces do have managed versions:

    void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, 
                             unsigned long maxlen);
    void pcim_iounmap(struct pci_dev *pdev, void __iomem *addr);

For the allocation of interrupts, the managed API is:

    int devm_request_irq(struct device *dev, unsigned int irq,
		         irq_handler_t handler, unsigned long irqflags,
		     	 const char *devname, void *dev_id);
    void devm_free_irq(struct device *dev, unsigned int irq, void *dev_id);

For these functions, the addition of a struct device argument was required.

There is a new set of functions for the mapping of of I/O ports and memory:

    void __iomem *devm_ioport_map(struct device *dev, unsigned long port,
			          unsigned int nr);
    void devm_ioport_unmap(struct device *dev, void __iomem *addr);
    void __iomem *devm_ioremap(struct device *dev, unsigned long offset,
			       unsigned long size);
    void __iomem *devm_ioremap_nocache(struct device *dev, 
                                       unsigned long offset,
				       unsigned long size);
    void devm_iounmap(struct device *dev, void __iomem *addr);

Once again, these functions required the addition of a struct device argument for the managed form.

Finally, for those using the low-level resource allocation functions, the managed versions are:

    struct resource *devm_request_region(struct device *dev,
				         resource_size_t start,
					 resource_size_t n, 
					 const char *name);
    void devm_release_region(resource_size_t start, resource_size_t n);
    struct resource *devm_request_mem_region(struct device *dev,
				             resource_size_t start,
					     resource_size_t n, 
					     const char *name);
    void devm_release_mem_region(resource_size_t start, resource_size_t n);

The resource management layer includes a "group" mechanism, accessed via these functions:

    void *devres_open_group(struct device *dev, void *id, gfp_t gfp);
    void devres_close_group(struct device *dev, void *id);
    void devres_remove_group(struct device *dev, void *id);
    int devres_release_group(struct device *dev, void *id);

A group can be thought of as a marker in the list of allocations associated with a given device. Groups are created with devres_open_group(), which can be passed an id value to identify the group or NULL to have the ID generated on the fly; either way, the resulting group ID is returned. A call to devres_close_group() marks the end of a given group. Calling devres_remove_group() causes the system to forget about the given group, but does nothing with the resources allocated within the group. To remove the group and immediately free all resources allocated within that group, devres_release_group() should be used.

The group functions seem to be primarily aimed at mid-level code - the bus layers, for example. When bus code tries to attach a driver to a device, for example, it can open a group; should the driver attach fail, the group can be used to free up any resources allocated by the driver.

There are not many users of this new API in the kernel now. That may change over time as driver writers become aware of these functions, and, perhaps, as the list of managed allocation types grows. The reward for switching over to managed allocations should be more robust and simpler code as current failure and cleanup paths are removed.

Index entries for this article

Kernel Resources

Index entries for this article
Kernel	Resources

The managed resource API

Posted Feb 22, 2007 15:19 UTC (Thu) by nix (subscriber, #2304) [Link]

It seems odd that an ease-of-use interface like the pcim_*() interface provides only partial coverage of the interfaces it's supposed to replace. Users are supposed to remember to use e.g. pci_disable_device(), but if they do the orthogonal thing and use pci_enable_device() as well, they've just introduced a bug.

Surely the sensible thing to do is to have an inline wrapper that introduces pcim_disable_device() that just thunks to pci_disable_device() (and so on for other non-covered functions in this set)? This wouldn't introduce any extra overhead, and would increase consistency.

The managed resource API

Posted Feb 22, 2007 19:47 UTC (Thu) by jospoortvliet (guest, #33164) [Link] (1 responses)

I wonder if this has any performance impact?

The managed resource API

Posted Feb 22, 2007 22:54 UTC (Thu) by nlucas (guest, #33793) [Link]

I have not seen the final patch, but it should only affect initialization and cleanup phases, so it would at most make module loading/unloading slower.

I guess that if one doesn't load all kernel modules right at once (and most distros with udev don't) the effect will not be measurable.

The managed resource API

Posted Feb 23, 2007 23:28 UTC (Fri) by flewellyn (subscriber, #5047) [Link] (11 responses)

So we've come incrementally closer to having a garbage collector in the kernel, I see.

Why they don't just bite the bullet and DO it, I have no idea.

The managed resource API

Posted Feb 24, 2007 10:11 UTC (Sat) by k8to (guest, #15413) [Link] (7 responses)

If you consider a reference counter to be a type of garbage collector, it seems they are slowly doing it. I don't realy expect a heavier garbage collector to appear anytime soon, however.

The managed resource API

Posted Mar 1, 2007 8:40 UTC (Thu) by rwmj (subscriber, #5474) [Link] (6 responses)

The problem is that reference counting is the heaviest garbage collector there is. Real, well-implemented GCs[1] are much better than reference counting.

Rich.

[1] And they are rare. Java and Emacs have really bad GCs, but because everyone is familiar with them, they assume that GC itself is bad.

The managed resource API

Posted Mar 1, 2007 18:40 UTC (Thu) by quintesse (guest, #14569) [Link] (5 responses)

Java has a bad GC?? Do you have any kind of basis for that statement because I was under the impression that it was doing a pretty good job. Sun engineers are working constantly to improve Java's GC, often not just by incremental changes but also by including completely new algorithms. So I'm sure they would be delighted if someone could point out to them that there are far superior GCs out there!

The managed resource API

Posted Mar 1, 2007 19:57 UTC (Thu) by rwmj (subscriber, #5474) [Link] (4 responses)

That was a bit of a dig at Java. There are a number of architectural problems with the Java runtime which is why it's taken an enormous effort to get something which is still (IME) pretty sluggish except under very special conditions (long running server-side processes).

Compare to the OCaml GC. OCaml is maintained by a small team of developers and regularly performs close to and often faster than equivalent C programs. They made a lot of very smart choices with runtime representations of values, and it helps that they have a couple of world-leading GC academics writing the code.

Rich.

The managed resource API

Posted Mar 1, 2007 21:29 UTC (Thu) by massimiliano (subscriber, #3048) [Link] (1 responses)

I'm interested!

Any link to a "readable" paper with a comparison of their approaches, or anyway some more detailed information?

The managed resource API

Posted Mar 1, 2007 21:55 UTC (Thu) by rwmj (subscriber, #5474) [Link]

There is a lot of details about the guts here, written by one of the aforementioned experts:

http://cristal.inria.fr/~doligez/caml-guts/

I'm afraid that I don't know of any readable introduction to the OCaml internals. However if you are
really interested then I'd recommend you read this chapter from the manual:

http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html

It took me about a year of on-again off-again reading to understand exactly how clever the runtime
representation of types is which the above chapter describes.

For a general introduction to OCaml, http://www.ocaml-tutorial.org/

Rich.

The managed resource API

Posted Mar 1, 2007 22:10 UTC (Thu) by joib (subscriber, #8541) [Link] (1 responses)

Then again, the Ocaml runtime is single-threaded, and the usual response to requests to make it multithreaded is that it would make the GC massively slower. Or to be precise, multiple threads are supported but there is a global lock so that only one thread may be executing ocaml code at a time. See e.g. http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64...

I mean, GC would certainly be nice but while I don't have any GC implementation experience I'd think the effort to write a concurrent GC scalable up to 2048-way NUMA machines (or whatever is the biggest single image system Linux runs on at the moment) is far from trivial.

The managed resource API

Posted Mar 2, 2007 9:07 UTC (Fri) by rwmj (subscriber, #5474) [Link]

There was a concurrent Caml Light, as shown in the link you posted.

My problem is not that the kernel devs have decided on and rejected garbage collection, but that they're implementing a really bad form of garbage collection without any rational investigation into the field.

Rich.

The managed resource API

Posted Feb 25, 2007 12:13 UTC (Sun) by liljencrantz (guest, #28458) [Link] (1 responses)

I disagree. A garbage collector is a 'Fire and forget' memory allocation strategy. What is described here is a memory allocator that groups together allocations so that you can tell the system "when this piece of memory here is recycled, then these pieces over here are no longer needed either". That is sometimes exteremely useful, since it turns out that allocation time is often a time when you have pretty detailed knowledge of when something will die.

The managed resource API

Posted Mar 1, 2007 8:44 UTC (Thu) by rwmj (subscriber, #5474) [Link]

A garbage collector is a 'Fire and forget' memory allocation strategy. What is described here is a memory allocator that groups together allocations so that you can tell the system "when this piece of memory here is recycled, then these pieces over here are no longer needed either".

That shows a poor understanding of garbage collectors. In fact a GC has precisely this information - that groups of memory are related and that when one piece of memory is no longer reachable, that causes other memory allocations to be unreachable too. This is encoded implicitly in the relationship of pointers between memory areas. All that is happening here is that the kernel devs have made this relationship explicit, inefficiently.

Rich.

The managed resource API

Posted Feb 26, 2007 16:53 UTC (Mon) by nix (subscriber, #2304) [Link]

There already is at least one (in net/unix/Space.c, IIRC). If you don't garbage-collect fds in transit down AF_UNIX sockets, you open a trival DoS hole. (But that's a very special-purpose GC so may not count.)

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

The managed resource API

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!