Statistics and tracepoints

By Jonathan Corbet
August 24, 2010

One thing that kernels do is collect statistics. If one wishes to know how many multicast packets have been received, page faults have been incurred, disk reads have been performed, or interrupts have been received, the kernel has the answer. This role is not normally questioned, but, recently, there have been occasional suggestions that the handling of statistics should be changed somewhat. The result is a changing view of how information should be extracted from the kernel - and some interesting ABI questions.

Back in July, Gleb Natapov submitted a patch changing the way paging is handled in KVM-virtualized guests. Included in the patch was the collection of a couple of new statistics on page faults handled in each virtual CPU. More than one month later (virtualization does make things slower), Avi Kivity reviewed the patch; one of his suggestions was:

Please don't add more stats, instead add tracepoints which can be converted to stats by userspace.

Nobody questioned this particular bit of advice. Perhaps that's because virtualization seems boring to a lot of developers. But it is also indicative of a wider trend.

That trend is, of course, the migration of much kernel data collection and processing to the "perf events" subsystem. It has only been one year since perf showed up in a released kernel, but it has seen sustained development and growth since then. Some developers have been known to suggest that, eventually, the core kernel will be an obscure bit of code that must be kept around in order to make perf run.

Moving statistics collection to tracepoints brings some obvious advantages. If nobody is paying attention to the statistics, no data is collected and the overhead is nearly zero. When individual events can be captured, their correlation with other events can be investigated, timing can be analyzed, associated data can be captured, etc. So it makes some sense to export the actual events instead of boiling them down to a small set of numbers.

The down side of using tracepoints to replace counters is that it is no longer possible to query statistics maintained over the lifetime of the system. As Matt Mackall observed over a year ago:

Tracing is great for looking at changes, but it completely falls down for static system-wide measurements because it would require integrating from time=0 to get a meaningful summation. That's completely useless for taking a measurement on a system that already has an uptime of months.

Most often, your editor would surmise, administrators and developers are looking for changes in counters and do not need to integrate from time=0. There are times, though, when that information can be useful to have. One could come close by enabling the tracepoints of interest during the bootstrap process and continuously collecting the events, but that can be expensive, especially for high-frequency events.

There is another important issue which has been raised in the past and which has never really been resolved. Tracepoints are generally seen as debugging aids used mainly by kernel developers. They are often tied into low-level kernel implementation details; changes to the code can often force changes to nearby tracepoints, or make them entirely obsolete. Tracepoints, in other words, are likely to be nearly as volatile as the kernel that they are instrumenting. The kernel changes rapidly, so it stands to reason that the tracepoints would change rapidly as well.

Needless to say, changing tracepoints will create problems for any user-space utilities which make use of those tracepoints. Thus far, kernel developers have not encouraged widespread use of tracepoints; the kernel still does not have that many of them, and, as noted above, they are mainly debugging tools. If tracepoints are made into a replacement for kernel statistics, though, then the number of user-space tools using tracepoints can only increase. And that will lead to resistance to patches which change those tracepoints and break the tools.

In other words, tracepoints are becoming part of the user-space ABI. Despite the fact that concerns about the ABI status tracepoints have been raised in the past, this change seems to be coming in through the back door with no real planning. As Linus has pointed out in the past, the fact that nobody has designated tracepoints as part of the official ABI or documented them does not really change things. Once an interface has been exposed to user space and come into wider use, it's part of the ABI regardless of the developers' intentions. If user-space tools use tracepoints, kernel developers will have to support those tracepoints indefinitely into the future.

Past discussions have included suggestions for ways to mark tracepoints which are intended to be stable, but no conclusions have resulted. So the situation remains murky. It may well be that things will stay that way until some future kernel change breaks somebody's tools. Then the kernel community will be forced to choose between restoring compatibility for the broken tracepoints or overtly changing its longstanding promise not to break the user-space ABI (too often). It might be better to figure things out before they get to that point.

Index entries for this article

Kernel Development model/User-space ABI

Kernel Tracing

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	Tracing

Statistics and tracepoints

Posted Aug 26, 2010 5:31 UTC (Thu) by thedevil (guest, #32913) [Link]

"Past discussions have included suggestions for ways to mark tracepoints which are intended to be stable, but no conclusions have resulted. So the situation remains murky. It may well be that things will stay that way until some future kernel change breaks somebody's tools. Then the kernel community will be forced to choose between restoring compatibility for the broken tracepoints or overtly changing its longstanding promise not to break the user-space ABI (too often). It might be better to figure things out before they get to that point."

This reminds me of global inaction in the face of climate change. Am I obsessed?

Statistics and tracepoints

Posted Aug 26, 2010 13:13 UTC (Thu) by rvfh (guest, #31018) [Link]

It seems that someone writing an application making use of tracepoints should make it prepared for tracepoint disappearance. Probably they want some config file that can evolve to follow the kernel under scrutiny.

Statistics and tracepoints

Posted Aug 26, 2010 16:13 UTC (Thu) by marineam (guest, #28387) [Link] (1 responses)

As a developer tracepoints sound like a powerful and sexy way to get detailed information on what the kernel is doing. As a sys-admin I already have my head full of how to deal with thousands of other pieces of software, now I have to learn another crazy tool just to get simple counters? And what about gathering long term trends? I'd much rather write a 1 minute cron job that reads a file in sys or proc and dumps a few numbers into RRDTool than writing a more complex application for listening to events from the kernel.

Both methods of gathering stats on things in the kernel are very useful and serve different needs to different people. Not everyone has the privilege of thinking like a kernel developer all the time. :-)

Statistics and tracepoints

Posted Aug 26, 2010 20:03 UTC (Thu) by dmk (guest, #50141) [Link]

Well, maybe an in-kernel API for counting things which would also be able to deliver trace-events to perf when userspace decides it wants them?

Or other way around: Any statistics API should probably be traceable... :)

Obligatory systemtap reference

Posted Aug 30, 2010 20:07 UTC (Mon) by sfink (guest, #6405) [Link] (1 responses)

...or the kernel devs could stop kicking systemtap in the gonads and adopt its tapset mechanism for abstracting away from the specific tracepoints as much or as little as you want. That at least provides a place where you could draw a compatibility line; I would not suggest that the current tapset library does that (or tries to).

I haven't used either perf or systemtap enough for my opinion to be relevant, but it really seems to me like the perf people are focused on a narrow audience that does not happen to include anyone who lives in userspace. Systemtap people *are* actively concerned with sysadmins, userspace developers, etc., and are working on the large and important set of user problems such as the API/ABI one described in this article. But stap's users and developers are getting scared off by the vague but generally negative attitude towards the project by the kernel developers.

Isn't it time for the perf community to come out and directly identify what they dislike about the systemtap approach, and state their plans for "the right way" to overcome the problems that systemtap is addressing?

There's obviously a fundamental difference between "log everything and analyze it afterward" vs "run analysis code online, possibly modifying what gets traced at runtime, and report only on digested results". Is that all it is? They're mutually compatible, and as a user I've had uses for both on different problems.

To be sure, the systemtap community could do a much better job of giving examples of problems that required their approach -- but why should they go to the effort of describing those if they're just going to be ignored anyway?

(My example: I needed to identify the source of a periodic 10ms latency in between invocations of my realtime-scheduled thread. I wrote a systemtap script to record the end time of my thread's wakeup, subtract that from the start time of the next wakeup, and if that was <3ms I would throw out the various traces I had logged in between. If it was greater, I'd remember those traces plus grab some more expensive stuff (stack traces). Numbers are from memory and guaranteed to be wrong.)

Obligatory systemtap reference

Posted Sep 20, 2010 3:51 UTC (Mon) by mfedyk (guest, #55303) [Link]

personally, I don't like how systemtap dynamically generates kernel modules to do the tracing. I'd much rather have a lib of operations that get called from a domain specific language.

it's all about peace of mind when dealing with production systems.

Statistics and tracepoints

Statistics and tracepoints

Statistics and tracepoints

Statistics and tracepoints

Statistics and tracepoints

Obligatory systemtap reference

Obligatory systemtap reference

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!