Drivers as documentation
One might argue that the job of documenting the hardware falls on whoever writes the associated datasheet. There is some truth to that claim, but, in many cases, only the original author of the driver has access to that datasheet. Those who come after can try to extract documentation from the vendor or to search for clandestine copies hosted on the net. But often the only option is to figure out the hardware from the one source of information that is actually available: the existing driver. If the driver source does not help that new developer, one can argue that the original author has fallen down on the job.
So, if a driver contains code like:
writel(devp->regs[42], 0xf4ee0815);
it is missing something important. In the absence of the datasheet, there is no way for any other developer to have any clue of what that operation is actually doing.
The problem is worse than that, though; datasheets often omit useful information, obscure the truth, and lie through their teeth. The hardest part of getting a driver to work is often the process of figuring out what the hardware's features and special needs really are. It often seems, for example, that the datasheet is written before the process of designing the hardware begins. As time passes, the understanding of the problem grows, and deadlines loom, hardware engineers start to jettison features that cannot be made to work in time or that, in their sole and not-subject-to-appeal opinion, can be painlessly fixed in software. Updating the datasheet to match the actual hardware never happens.
Thoughtful driver developers will, on discovery of the imaginary nature of a specific hardware feature, add a comment to the driver; that way, no future maintainer has to figure out (the hard way, involving keyboard imprints on the forehead) why the driver does not use a specific, helpful-looking hardware capability.
Then there is the matter of "reserved" bits. There has not yet been a datasheet written that did not contain entries like:
Weird tangential functions register (offset 0xc8) Bits Function 17 Reserved: do not touch this bit or the terrorists will win
Somewhere, deep within the company, there will be a maximum of two engineers who know that the document is incomplete, but that nobody had ever gotten around to updating it. If you can corner one of those people, you can usually get them to admit that this bit should be documented as:
Weird tangential functions register (offset 0xc8) Bits Function 17 0 = DMA engine randomly locks up
1 = DMA engine functions as expected
Default value = 0
A developer who cannot get his hands within range of the neck of at least one of those hardware engineers will likely spend a lot of time figuring out that they need to set the "make it work" bit. This effort can involve reverse-engineering proprietary drivers or, in cases of pure desperation, playing with random bits to see what changes. Once that bit has been located, it is natural for the tired and frustrated developer to quietly set the bit before heading off in a determined effort to eliminate the memory of the entire process through the application of large amounts of beer. A particularly forward-thinking developer might make a note on a printed version of the datasheet for future reference.
But handwritten notes are not usually helpful to the next developer who has to work on that driver. A moment spent documenting that bit:
#define WTF_PRETTY_PLEASE 0x00020000 /* Always set this or it locks up */
may save somebody else hours of unnecessary pain.
It is tempting to think of a completed driver as being done. But driver code, like other kernel code, is subject to ongoing change. Kernel API changes must be dealt with, problems need to be fixed, and newer versions of the hardware must be supported. Depending on how much beer was involved, the original author may remember that device's peculiarities, but those who follow will not. Everybody would be better served if the driver did not just make the hardware work, but if it also made the reader understand how the hardware works.
Doing so is not usually hard. Define descriptive names for registers, bits, and fields rather than putting in hard-coded constants. Note features that are incompletely described, incorrectly described, or entirely science-fictional. Comment operations that have non-obvious ordering requirements or that do not play well together. And, in general, code with a great deal of sympathy for the people who will have to make changes to your work in the future. Some hardware can never be properly documented because the relevant information is simply not available; see this 2006 article for an example. But what information is available should be made available to others.
Core kernel hackers are occasionally heard to make dismissive remarks
about driver developers and the work they do. But driver writers are often
given a difficult task involving a fair amount of detective work; they get
this task done and make our hardware work for us. Writing drivers that
adequately document the hardware is not an unreasonable thing to ask of
these developers; they have the hardware knowledge and the skills to do
it. The harder problem may be asking driver reviewers to insist
that this extra effort be made. Without pressure from reviewers, many
drivers will never enable readers to really understand what is going on.
Index entries for this article | |
---|---|
Kernel | Development model/Kernel quality |
Drivers as documentation
Posted Nov 23, 2011 23:50 UTC (Wed)
by magnuson (subscriber, #5114)
[Link] (3 responses)
Posted Nov 23, 2011 23:50 UTC (Wed) by magnuson (subscriber, #5114) [Link] (3 responses)
As for the the example I can tell you precisely why it's documented that way. "DMA engine randomly locks up" is a hardware bug that customers may demand fixed at great expense. A typical full layer spin for an IC will run you about $500,000 these days and that's if you have push a lot of volume. "Reserved" is no problem at all. Just don't do that. Simple.
I've coded a few magic constants in my day but they've been clearly labels as such with at least an attempt to describe what they do. Mostly they are escape hatches in case of bugs elsewhere... In any case my neck remain un-wrung to I must be doing something right.
Drivers as documentation
Posted Nov 24, 2011 11:57 UTC (Thu)
by gnb (subscriber, #5132)
[Link] (1 responses)
Posted Nov 24, 2011 11:57 UTC (Thu) by gnb (subscriber, #5132) [Link] (1 responses)
Ideally yes, but in the example given the bit defaults to the wrong value and needs to be set in defiance of the datasheet. This does happen.
Drivers as documentation
Posted Nov 25, 2011 5:56 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
Posted Nov 25, 2011 5:56 UTC (Fri) by jzbiciak (guest, #5246) [Link]
So it doesn't surprise me at all that there might not be reasonable customer-level documentation for some of these "reserved" bits. At the same time, I can also see that workarounds for flaky designs ("Oops, the frobnitz accelerator sometimes frotzes when it should gronk, if two quux DMAs come in consecutive cycles") might rely on these "reserved" bits. The bit in the article above might be documented internally as "disable frobnitz acceleration (internal test purposes only; do not use for normal operation)".
Other times, it's due to the same peripheral getting used in different configurations on different chips, and the field in question should be "irrelevant" for this particular chip. So the bit exists and the feature it controls exists, but the feature isn't necessarily useful on this chip, or wasn't spec'd to be on this chip, and so it might not be tested. The fact that enabling it anyway stops DMAs from randomly crashing might be a happy accident.
In my day job, I'm at the head of one of these pipelines, writing specs that lead to the hardware and later to the customer documentation. I also had some good chuckles reading through this article. From what I've seen, the process of turning my specs into end customer documentation involves a lot of deleting (missing some of the internal implementation details, but invariably deleting some important detail customers need), and inserting several *ahem* interesting grammatical twists and confusing diagrams. I don't envy the folks that have to make the end-customer documentation from my specs, but sausage making is sausage making, no matter who makes it.
Drivers as documentation
Posted Dec 1, 2011 16:41 UTC (Thu)
by LeftCoastDave (guest, #81645)
[Link]
Posted Dec 1, 2011 16:41 UTC (Thu) by LeftCoastDave (guest, #81645) [Link]
This article hit very close to home, especially the part about DMA randomly locking up. I previously spent a lot of time on a deeply embedded firmware design where I would scan all DMA operations looking for certain lockup conditions, then abort the DMA hardware operation and replace the operation by a firmware memcpy.
I believe the issue with digital HW engineers comes from their work environment. Here are a few reasons:
1) Their designs tend to be quite in depth, and it is difficult to read verilog/vhdl code, so the code is heavily supported by word docs with block diagrams. This of course results in code slowly migrating away from documentation, and without the review process only the designer of the module knows how it works. They've done this for so long, that they believe this is the only way to do development.
2) Because of (1), brick walls form around modules, and only black box testing is done by simulations. Now no one looks at your code, no one appreciates being told that their code isn't up to par for readability, so they get very defensive and don't like to admit any fault internal to their design.
3) On top of all that chip schedules are always rushed, but they have no easy ability to go back and fix small mistakes, so they try and hide them under the carpet and blame firmware, or expect firmware to just figure out how to make it work.
4) It takes month for a chip to come back from the fab plant, months later for firmware to finish a dev. kit for it, and start shipping it out to customers, months later when a customer finds a bug. A year turn around is normal, longer is not uncommon. Digital HW engineers are usually onto another project by that time
Now every time a bug is found there is knee jerk reaction to blame firmware, even when a firmware team approaches the problem as "we don't know where the bug is, but we need helping locating it". The HW engineer just doesn't want to open up that can of worms, they know was just a rush job. Excuses like "We aren't going to re-spin the chip for this bug, so why should I spend time looking into it", are the norm. This really doesn't help, firmware implement a workaround, when the firmware team doesn't know the root cause.
I believe the only way this can change is from a management viewpoint that digital designers aren't just designing a chip right now, but they need to support the chip for years, and that should be budgeted for right up front. Documentation including verilog/vhdl code comments can happen well after the chip has taped out. This needs to be a serious priority in any schedule.
Drivers as documentation
Posted Nov 24, 2011 10:30 UTC (Thu)
by wsa (guest, #52415)
[Link] (2 responses)
Posted Nov 24, 2011 10:30 UTC (Thu) by wsa (guest, #52415) [Link] (2 responses)
I am mainly active in driver subsystems which are used for current SoCs, so more than one. In none of those subsystems, such hardcoded values you mentioned would hardly ever pass a review. Comments are also required every time the datasheet is wrong or unclear. Most reviewers are very aware of that and insist on that. The bigger problem here IMHO is that it is often very hard to see if a developer spent days on a simple writel(), because it is one of many writel() in the code. So, for a review which goes into that level of detail, a reviewer would have to study the datasheet at least as deep as the driver author. Deeper would be even better. Given that usually around 60% of all changes per release go to drivers, this is unlikely to happen. Most subsystems are already short of reviewers, because the code affects less people. We have a scaling problem here, and sadly, this is no news. Spreading the word like this hopefully helps a little bit, although I'd think this should more go to driver developers than reviewers...
Drivers as documentation
Posted Nov 25, 2011 5:59 UTC (Fri)
by jzbiciak (guest, #5246)
[Link] (1 responses)
Posted Nov 25, 2011 5:59 UTC (Fri) by jzbiciak (guest, #5246) [Link] (1 responses)
Drivers as documentation
Posted Nov 25, 2011 10:10 UTC (Fri)
by wsa (guest, #52415)
[Link]
Posted Nov 25, 2011 10:10 UTC (Fri) by wsa (guest, #52415) [Link]
Drivers as documentation
Posted Nov 24, 2011 13:21 UTC (Thu)
by juliank (guest, #45896)
[Link] (2 responses)
Posted Nov 24, 2011 13:21 UTC (Thu) by juliank (guest, #45896) [Link] (2 responses)
It looks fairly out of place compared to the rest of the kernel, but since we don't have any documentation on that device at the moment, and implement the driver in the community, having the documentation makes it easier to understand the driver and fix bugs.
We also have some strange things, like having to add an udelay() somewhere because the device otherwise locks up, and we don't know why. NVIDIA's original Android driver did not have those things and worked. Maybe there documentation has information about this, but we don't have it (although it might be made public if we're lucky).
Drivers as documentation
Posted Nov 25, 2011 6:02 UTC (Fri)
by jzbiciak (guest, #5246)
[Link] (1 responses)
Posted Nov 25, 2011 6:02 UTC (Fri) by jzbiciak (guest, #5246) [Link] (1 responses)
Ok, this is somewhat offtopic, but it piqued my curiosity...
We also have some strange things, like having to add an udelay() somewhere because the device otherwise locks up, and we don't know why
Memory ordering issue, and a barrier of some sort is required? Does the lockup happen on the same chip as the original usleep-less Android implementation?
Drivers as documentation
Posted Nov 25, 2011 14:44 UTC (Fri)
by juliank (guest, #45896)
[Link]
Posted Nov 25, 2011 14:44 UTC (Fri) by juliank (guest, #45896) [Link]
> Does the lockup happen on the same chip as the original
> usleep-less Android implementation?
Seems I misremembered. It does not lock up, it just sends
incomplete messages. I added an udelay(100) in commit
de839b8f06bc5dd3f5037c4409a720cbb9bf21c3 [1] which seems to
prevent that.
[1] https://git.kernel.org/?p=linux/kernel/git/torvalds/linux...
Thanks for the memories :-)
Posted Nov 24, 2011 16:50 UTC (Thu)
by felixfix (subscriber, #242)
[Link]
Posted Nov 24, 2011 16:50 UTC (Thu) by felixfix (subscriber, #242) [Link]
Hair pulling while it went on, but as progress was made, satisfaction grew, and the final victory was sweet indeed.
Thanks for the memories. A part of me hopes the engineers are never given so much time and incentive that the datasheets are always correct ...
Must Be One
Posted Nov 24, 2011 17:36 UTC (Thu)
by wildea01 (subscriber, #71011)
[Link]
Posted Nov 24, 2011 17:36 UTC (Thu) by wildea01 (subscriber, #71011) [Link]
BIT: 20
DESCRIPTION: Must Be One (MBO). This bit must be set to 1 for normal device operation
TYPE: R/W
DEFAULT (reset value): 0
We do at least set it in Linux, but it's a bit cryptic:
#define HW_CFG_SF_ (0x00100000) /* R/W */
God knows what it's really doing.
Drivers as documentation
Posted Nov 24, 2011 19:57 UTC (Thu)
by giraffedata (guest, #1954)
[Link]
I have on multiple occasions written an entire manual for a piece of hardware as I reverse engineered it in trying to use it. The problems go beyond incorrect and flatly missing information, because there is also information that just isn't clear. There's a certain satisfaction in setting the world right by doing the job the hardware supplier failed to do.
Posted Nov 24, 2011 19:57 UTC (Thu) by giraffedata (guest, #1954) [Link]
But I don't see much value in agitating for device driver writers to do this work. The same reasons that drive a device maker not to provide adequate documentation also drive the driver writers.
Our salvation may be that the drivers, unlike the data sheets, are open source, so the next frustrated user might correct the problem.
Generating driver code from specification
Posted Nov 24, 2011 22:10 UTC (Thu)
by cpeterso (guest, #305)
[Link]
Posted Nov 24, 2011 22:10 UTC (Thu) by cpeterso (guest, #305) [Link]
A sneaky benefit is that the spec DSL could be designed such that device manufacturers must better document how their hardware actually works. :)
Intel Labs is working on a project (with funding from Google) called Termite that generates driver code from a hardware specification language:
http://ertos.nicta.com.au/research/drivers/synthesis/
http://www.theregister.co.uk/2011/06/10/automatic_device_...
Drivers as documentation
Posted Nov 27, 2011 5:52 UTC (Sun)
by neilbrown (subscriber, #359)
[Link] (1 responses)
Posted Nov 27, 2011 5:52 UTC (Sun) by neilbrown (subscriber, #359) [Link] (1 responses)
>
> writel(devp->regs[42], 0xf4ee0815);
>
> it is missing something important.
One could argue that one important thing this code fragment is missing is GPL compliance.
> The source code for a work means the preferred form of the work for
> making modifications to it.
You cannot make sensible modifications to the above. You need it to be combined with documentation. So if the datasheet were included with the source code it might be OK. But without the data sheet ... it seems little different to binary firmware blobs.
(I've just been looking at init_lb035q02_panel() in
drivers/video/omap2/displays/panel-lgphilips-lb035q02.c
and feel that something is definitely missing in the freedom I am being given to understand and modify this code).
Do we need a sub-tree of Documentation which contains lots of PDFs of data sheets???
Datasheets
Posted Nov 27, 2011 15:36 UTC (Sun)
by corbet (editor, #1)
[Link]
The problem, of course, is that lots of driver developers get the datasheets under NDAs and cannot contribute them anywhere public. There are those who say that we simply should not develop drivers under such conditions, while others say it's better to have the driver with as much useful information crammed into it as possible.
Posted Nov 27, 2011 15:36 UTC (Sun) by corbet (editor, #1) [Link]
GPL compliance seems like a hard claim to make, in any case. There is no other form of the code to be "the preferred form of the work for making modifications."