Statistics and tracepoints
Back in July, Gleb Natapov submitted a patch changing the way paging is handled in KVM-virtualized guests. Included in the patch was the collection of a couple of new statistics on page faults handled in each virtual CPU. More than one month later (virtualization does make things slower), Avi Kivity reviewed the patch; one of his suggestions was:
Nobody questioned this particular bit of advice. Perhaps that's because virtualization seems boring to a lot of developers. But it is also indicative of a wider trend.
That trend is, of course, the migration of much kernel data collection and processing to the "perf events" subsystem. It has only been one year since perf showed up in a released kernel, but it has seen sustained development and growth since then. Some developers have been known to suggest that, eventually, the core kernel will be an obscure bit of code that must be kept around in order to make perf run.
Moving statistics collection to tracepoints brings some obvious advantages. If nobody is paying attention to the statistics, no data is collected and the overhead is nearly zero. When individual events can be captured, their correlation with other events can be investigated, timing can be analyzed, associated data can be captured, etc. So it makes some sense to export the actual events instead of boiling them down to a small set of numbers.
The down side of using tracepoints to replace counters is that it is no longer possible to query statistics maintained over the lifetime of the system. As Matt Mackall observed over a year ago:
Most often, your editor would surmise, administrators and developers are looking for changes in counters and do not need to integrate from time=0. There are times, though, when that information can be useful to have. One could come close by enabling the tracepoints of interest during the bootstrap process and continuously collecting the events, but that can be expensive, especially for high-frequency events.
There is another important issue which has been raised in the past and which has never really been resolved. Tracepoints are generally seen as debugging aids used mainly by kernel developers. They are often tied into low-level kernel implementation details; changes to the code can often force changes to nearby tracepoints, or make them entirely obsolete. Tracepoints, in other words, are likely to be nearly as volatile as the kernel that they are instrumenting. The kernel changes rapidly, so it stands to reason that the tracepoints would change rapidly as well.
Needless to say, changing tracepoints will create problems for any user-space utilities which make use of those tracepoints. Thus far, kernel developers have not encouraged widespread use of tracepoints; the kernel still does not have that many of them, and, as noted above, they are mainly debugging tools. If tracepoints are made into a replacement for kernel statistics, though, then the number of user-space tools using tracepoints can only increase. And that will lead to resistance to patches which change those tracepoints and break the tools.
In other words, tracepoints are becoming part of the user-space ABI. Despite the fact that concerns about the ABI status tracepoints have been raised in the past, this change seems to be coming in through the back door with no real planning. As Linus has pointed out in the past, the fact that nobody has designated tracepoints as part of the official ABI or documented them does not really change things. Once an interface has been exposed to user space and come into wider use, it's part of the ABI regardless of the developers' intentions. If user-space tools use tracepoints, kernel developers will have to support those tracepoints indefinitely into the future.
Past discussions have included suggestions for ways to mark
tracepoints which are intended to be stable, but no conclusions have
resulted. So the situation remains murky. It may well be that things will
stay that way until some future kernel change breaks somebody's tools.
Then the kernel community will be forced to choose between restoring
compatibility for the broken tracepoints or overtly changing its
longstanding promise not to break the user-space ABI (too often). It might
be better to figure things out before they get to that point.
Index entries for this article | |
---|---|
Kernel | Development model/User-space ABI |
Kernel | Tracing |
Posted Aug 26, 2010 5:31 UTC (Thu)
by thedevil (guest, #32913)
[Link]
This reminds me of global inaction in the face of climate change. Am I obsessed?
Posted Aug 26, 2010 13:13 UTC (Thu)
by rvfh (guest, #31018)
[Link]
Posted Aug 26, 2010 16:13 UTC (Thu)
by marineam (guest, #28387)
[Link] (1 responses)
Both methods of gathering stats on things in the kernel are very useful and serve different needs to different people. Not everyone has the privilege of thinking like a kernel developer all the time. :-)
Posted Aug 26, 2010 20:03 UTC (Thu)
by dmk (guest, #50141)
[Link]
Or other way around: Any statistics API should probably be traceable... :)
Posted Aug 30, 2010 20:07 UTC (Mon)
by sfink (guest, #6405)
[Link] (1 responses)
I haven't used either perf or systemtap enough for my opinion to be relevant, but it really seems to me like the perf people are focused on a narrow audience that does not happen to include anyone who lives in userspace. Systemtap people *are* actively concerned with sysadmins, userspace developers, etc., and are working on the large and important set of user problems such as the API/ABI one described in this article. But stap's users and developers are getting scared off by the vague but generally negative attitude towards the project by the kernel developers.
Isn't it time for the perf community to come out and directly identify what they dislike about the systemtap approach, and state their plans for "the right way" to overcome the problems that systemtap is addressing?
There's obviously a fundamental difference between "log everything and analyze it afterward" vs "run analysis code online, possibly modifying what gets traced at runtime, and report only on digested results". Is that all it is? They're mutually compatible, and as a user I've had uses for both on different problems.
To be sure, the systemtap community could do a much better job of giving examples of problems that required their approach -- but why should they go to the effort of describing those if they're just going to be ignored anyway?
(My example: I needed to identify the source of a periodic 10ms latency in between invocations of my realtime-scheduled thread. I wrote a systemtap script to record the end time of my thread's wakeup, subtract that from the start time of the next wakeup, and if that was <3ms I would throw out the various traces I had logged in between. If it was greater, I'd remember those traces plus grab some more expensive stuff (stack traces). Numbers are from memory and guaranteed to be wrong.)
Posted Sep 20, 2010 3:51 UTC (Mon)
by mfedyk (guest, #55303)
[Link]
it's all about peace of mind when dealing with production systems.
Statistics and tracepoints
Statistics and tracepoints
Statistics and tracepoints
Statistics and tracepoints
Obligatory systemtap reference
Obligatory systemtap reference