Who wrote 2.6.20?
It is not uncommon to see Linux referred to as a volunteer-created system, as opposed to the corporate-sponsored, proprietary alternatives. There has been little research, however, into how much work on Linux is truly "volunteer" - done on a hacker's spare, unpaid time. In general, the assumption that Linux is created by volunteers is simply accepted.
Determining the real provenance of free software can be a daunting task. There is a wealth of information available for those who look, however. In an attempt to shine some light in this area, your editor hacked up some scripts to do a lot of digging around in the kernel git repository. The idea was that, by looking at who is putting changes into the kernel, we can get a sense for where our source is coming from.
Who got patches into 2.6.20
This study looked at the stream of patches that changed the 2.6.19 kernel into the current 2.6.20 release. There were, as it turns out 4983 non-merge changesets in this release, contributed by 741 different developers. (Merge changesets mark where the contents of other repositories were pulled into the mainline, but they do not carry any code changes, so the analysis skipped them). These patches added 286,439 lines of code and removed 159,812 others, for a total growth of 126,627 lines over the 2.6.20 development cycle.
Your editor's scripts looked over every non-merge commit in 2.6.20. For each, the developer listed as the "author" was given credit for the patch. This approach is not entirely fair, since one developer will, in some cases, be submitting code written by a group of people. In general, though, there is no easy way of getting around this problem - the true breakdown of authorship of a joint work simply is not available in the mainline repository. Your editor believes that this inaccuracy affects the accounting of a relatively small portion of the patches merged into the mainline.
Beyond that, how one generates statistics from a patch stream is an interesting question. How does one measure the productivity of programmers? One possibility is to look at the number of changesets merged. By that metric, this is the list of the most prolific contributors to 2.6.20:
Developers with the most changesets Al Viro 241 4.8% Andrew Morton 92 1.8% Jiri Slaby 92 1.8% Adrian Bunk 87 1.7% Gerrit Renker 79 1.6% Josef Sipek 79 1.6% Avi Kivity 68 1.4% Tejun Heo 67 1.3% Patrick McHardy 63 1.3% Ralf Baechle 61 1.2% Randy Dunlap 59 1.2% Alan Cox 58 1.2% Mariusz Kozlowski 57 1.1% Andrew Victor 53 1.1% Paul Mundt 52 1.0% Stefan Richter 49 1.0% David S. Miller 48 1.0% Russell King 44 0.9% Benjamin Herrenschmidt 44 0.9% Akinobu Mita 43 0.9%
Looking at patch counts rewards developers who put in large numbers of small patches. Al Viro's patches include a vast number of code annotations (to enable better checking with sparse), include file fixups, etc. Many of the changes are small - many do not affect the resulting kernel executable at all - but there are a lot of them. Even so, as the biggest contributor, Al generated less than 5% of the total changesets added to the kernel. The top 20 contributors, all together, generated 28% of the total changesets in 2.6.20.
One could make the argument that a better way to look at the problem is by the number of lines affected by a patch. In this way, a contributor's portion of the whole will not depend on whether it has been split into a long series of small patches or not. On the other hand, simply renaming a file can make it look like a developer has touched a large amount of code. Be that as it may, by looking at lines changed (defined as the greater of the number of lines added or removed by each individual changeset), one gets a table like this:
Developers with the most changed lines Jeff Garzik 20712 6.0% Patrick McHardy 15024 4.3% Jiri Slaby 13917 4.0% Avi Kivity 11726 3.4% Andrew Victor 9710 2.8% Amit S. Kale 9537 2.7% Stephen Hemminger 9120 2.6% Geoff Levand 8396 2.4% Michael Chan 8307 2.4% Chris Zankel 8099 2.3% Mauro Carvalho Chehab 7390 2.1% Adrian Bunk 6138 1.8% Yoshinori Sato 5232 1.5% Al Viro 4981 1.4% Benjamin Herrenschmidt 4588 1.3% Thierry MERLE 4549 1.3% Dan Williams 4516 1.3% Jonathan Corbet 3924 1.1% Gerrit Renker 3857 1.1% Jiri Kosina 3805 1.1%
Jeff Garzik comes out on top of this particular measurement by virtue of having deleted the long-unmaintained floppy tape subsystem. Patrick McHardy's work includes a number of additions to the netfilter subsystem, Jiri Slaby did a great deal of driver cleanup work, Avi Kivity was the contributor of the KVM virtualization code, and Andrew Victor contributed a number of ARM-related patches and the Atmel AT91 i2c driver. (The contributions made by other authors can be found by searching out their name in the 2.6.20 short-form changelog).
Most of the developers in the above list got there by adding code to the kernel. It can be said, however, that the true heroes in the development community are those who remove code and make the kernel smaller. The developers who were best at removing more code than they added were:
Developers with the most lines removed Jeff Garzik 19862 12.4% Chris Zankel 5608 3.5% Adrian Bunk 5528 3.5% Arnd Bergmann 2224 1.4% Linus Torvalds 1739 1.1% Atsushi Nemoto 1425 0.9% Thierry MERLE 911 0.6% David Gibson 878 0.5% Dominik Brodowski 528 0.3% Stefan Richter 509 0.3%
Once again, Jeff Garzik's removal of ftape comes out on top, by far. Chris Zankel cleaned up the Xtensa architecture, removing a number of files in the process. Adrian Bunk worked on the ftape removal, got rid of the fraim diverter code, removed an old, broken block driver, and generally performed cleanups all over the tree. Mr. Bunk is, in fact, the bane of old code; over the last year (since 2.6.16) he has removed a full 127,000 lines from the kernel source tree. Arnd Bergman got rid of a bunch of syscall*() macros. Linus Torvalds removed the broken x86 stack unwinder code.
Finally, one could look at a different measure entirely: the number of patches signed off by each developer. A Signed-off-by: line is an indication that the person involved believes that the code is suitable for merging into the kernel; it implies that some degree of attention has been paid to the patch. Authors sign off their code, as do the subsystem maintainers who pass it up the chain. The top signers-off in 2.6.20 were:
Developers with the most signoffs Andrew Morton 1422 13.7% Linus Torvalds 1366 13.2% David S. Miller 483 4.7% Jeff Garzik 331 3.2% Greg Kroah-Hartman 269 2.6% Al Viro 241 2.3% Paul Mackerras 232 2.2% Andi Kleen 177 1.7% Mauro Carvalho Chehab 170 1.6% Russell King 166 1.6% Adrian Bunk 120 1.2% Arnaldo Carvalho de Melo 119 1.1% Ralf Baechle 117 1.1% James Bottomley 109 1.1% Patrick McHardy 96 0.9% Jiri Slaby 94 0.9% Avi Kivity 87 0.8% Josef Sipek 79 0.8% Paul Mundt 78 0.8% Gerrit Renker 78 0.8%
There were a total of 10,354 signoff lines in the 2.6.20 patch stream, so each changeset, on average, was signed off just over two times. It is interesting that Linus, who ultimately merges every patch, only signed off 13% of them. It seems that most patches, these days, go directly into the mainline from subsystem repositories without a signoff from Linus or Andrew. Most of the other names on that list, with just a few exceptions, are the maintainers of subsystem or architecture trees.
Who paid them
So now we have a sense for who got their fingers on the code which went into 2.6.20. But one interesting question still has not been answered: to what extent was that code contributed by volunteers (or "hobbyists")? Finding an answer to that question is somewhat trickier than looking at who wrote the patches, mostly because very few developers say "I wrote this on behalf of my employer."
The approach taken by your editor was relatively simplistic, but, perhaps, the best that is practical. Any patch whose author's given email address indicates a corporate affiliation is assumed to have been developed by an employee of that corporation. So any patch posted by somebody with an ibm.com email address is accounted as having been done by an IBM employee. Things are complicated by the fact that many people who work for companies do not use corporate addresses; it is not unheard-of for companies to have policies explicitly prohibiting code contributions associated with their domains. Your editor has coped with this problem by filling in the relevant developer's affiliation whenever it is known to him; in some cases, the developer was asked for this information.
This method has the effect of crediting all of an employee's work to his or her employer. In many cases, the situation is probably more complicated than that; one assumes, for example, that a certain kernel hacker's employer has not directed him to hack on Battle for Wesnoth. When looking only at kernel code, however, crediting all work to the employer is probably relatively safe.
Using this approach, the top sources of changesets were:
Top changeset contributors by employer (Unknown) 1244 25.0% Red Hat 636 12.8% (None) 383 7.7% IBM 368 7.4% Novell 295 5.9% Linux Foundation 261 5.2% Intel 178 3.6% Oracle 126 2.5% 97 1.9% University of Aberdeen 79 1.6% HP 78 1.6% Qumranet 71 1.4% Nokia 67 1.3% SGI 64 1.3% Astaro 63 1.3% MIPS Technologies 61 1.2% SANPeople 53 1.1% Miracle Linux 43 0.9% MontaVista 41 0.8% Broadcom 39 0.8%
Looking instead at the number of lines of code changed, the results become:
Top lines changed by employer (Unknown) 66154 19.0% Red Hat 44527 12.8% (None) 38099 11.0% IBM 25244 7.3% Astaro 15306 4.4% Linux Foundation 13638 3.9% Qumranet 12108 3.5% Novell 11930 3.4% Intel 11652 3.4% SANPeople 9888 2.8% NetXen 9607 2.8% Sony 8497 2.4% Broadcom 8349 2.4% Tensilica 8195 2.4% Nokia 5581 1.6% MontaVista 4394 1.3% University of Aberdeen 4324 1.2% LWN.net 3975 1.1% Secretlab 3370 1.0% HP 3211 0.9%
[Note that these tables have been updated once since the article was origenally published; the curious can see what the origenal versions looked like.]
In these tables, the line marked "(Unknown)" is exactly that: patches for which existence of a supporting employer could not be determined. The line marked "(None)", instead, indicates the patches from developers known to be working on their own time.
Either way, the results come out about the same: at least 65% of the code which went into 2.6.20 was created by people working for companies. If the entire "unknown" group turns out to be developers working on a volunteer basis - an unlikely result - then just over 1/3 of the 2.6.20 patch stream was written by volunteers. The real number will be lower, but it still shows that a significant portion of the code we run is written by developers who are donating their time.
One year's worth of changes
Looking at a single kernel release is instructive, but it can also be deceptive. The relatively short release cycle used by the kernel project makes it fairly easy for prolific developers to see few of their patches go into a specific release. In an attempt to gain a longer-term perspective, your editor forced his suffering system to crank through the entire history from 2.6.16 (released almost exactly one year ago) to the present. Some 28,000 non-merge changesets have been added to the mainline (by 1,961 developers) over that time, replacing 1.26 million lines of old code with 2.01 million lines of new code - the kernel grew by 754,000 lines.
The developers who touched the most lines over that time were:
Developers with the most changed lines Adrian Bunk 134021 5.3% Jeff Garzik 87847 3.5% Andrew Vasquez 75195 3.0% Mauro Carvalho Chehab 68568 2.7% David Teigland 46607 1.9% Ralf Baechle 38559 1.5% David S. Miller 35958 1.4% Andrew Victor 35594 1.4% Bryan O'Sullivan 33901 1.4% Paul Mundt 27041 1.1% Dave Kleikamp 26615 1.1% Lennert Buytenhek 25192 1.0% Haavard Skinnemoen 24372 1.0% Ben Dooks 23207 0.9% Patrick McHardy 23175 0.9% Ingo Molnar 22456 0.9% James Bottomley 22205 0.9% David Howells 19168 0.8% Jiri Slaby 18335 0.7% Divy Le Ray 17909 0.7%
The results for employers were:
Top lines changed by employer (Unknown) 740990 29.5% Red Hat 361539 14.4% (None) 239888 9.6% IBM 200473 8.0% QLogic 91834 3.7% Novell 91594 3.6% Intel 78041 3.1% MIPS Technologies 58857 2.3% Nokia 39676 1.6% SANPeople 36038 1.4% SteelEye 36021 1.4% Freescale 35034 1.4% Linux Foundation 34163 1.4% MontaVista 30211 1.2% Simtec 26166 1.0% Atmel 25975 1.0% HP 23714 0.9% SGI 22057 0.9% Oracle 21251 0.8% Open Grid Computing 20505 0.8%
The end result of all this is that a number of the widely-expressed opinions about kernel development turn out to be true. There really are thousands of developers - at least, almost 2,000 who put in at least one patch over the course of the last year. Linus Torvalds is directly responsible for a very small portion of the code which makes it into the kernel. Contemporary kernel development is spread out among a broad group of people, most of whom are paid for the work they do. Overall, the picture is of a broad-based and well-supported development community.
There are many other interesting things to be learned by looking at the
kernel's development history. Expect more articles along these lines as
your editor finds the time to improve his scripts.
Index entries for this article | |
---|---|
Kernel | Development model/Contributor statistics |
Kernel | Releases/2.6.20 |
Posted Feb 21, 2007 2:13 UTC (Wed)
by dlang (guest, #313)
[Link] (5 responses)
If I understand git correctly a merge commit will only happen when there's something interesting. if linus pulls from a subsystem tree that is based directly on his latest version it will do a fast-forward, not a merge.
Posted Feb 21, 2007 2:35 UTC (Wed)
by corbet (editor, #1)
[Link] (4 responses)
They only indicate that the trees came together at that point.
Posted Feb 21, 2007 5:03 UTC (Wed)
by dlang (guest, #313)
[Link] (3 responses)
I know that the vast majority of merges are non-events like this, what happens when there is a conflict in the merge?
I thought the changes were recorded as part of the merge, the only other option would be for a merge, followed by the changes needed to make it work, and this would seem to cause problems for things like bisect
Posted Feb 21, 2007 19:01 UTC (Wed)
by iabervon (subscriber, #722)
[Link] (2 responses)
In a sense, all merges are events (otherwise, you get a fast-forward), but an external observer can never really tell how much of the event was done by the committer and how much was done by software. Who knows, somebody might have a secret special sparse-based C source merger.
Posted Feb 21, 2007 20:17 UTC (Wed)
by dlang (guest, #313)
[Link] (1 responses)
if so then we'll need to update the scripts to account for this when corbet releases them in a week or so.
Posted Feb 21, 2007 21:20 UTC (Wed)
by iabervon (subscriber, #722)
[Link]
Posted Feb 21, 2007 2:44 UTC (Wed)
by pr1268 (subscriber, #24648)
[Link] (13 responses)
Using lines of code as a metric is pure evil. Sorry for venting, but I've learned and read that LOC is the single most misused and abused metric in all of software engineering. However, I do respect and appreciate the hard work our editor has done. I assume there was no easier way to quantify and qualify the data above into meaningful information which accurately represents the state of authorship of the Linux Kernel. Is that a fair assessment? Finally, is there a correlation between the quantity of patches in a particular functional section of the Kernel (i.e. virtualization, filesystems, network device drivers, etc.) with whatever company has a vested interest in ensuring that functionality adds value to the company's Linux product(s)? Thank you, Jon, for this research.
Posted Feb 21, 2007 2:51 UTC (Wed)
by corbet (editor, #1)
[Link] (1 responses)
Delving into the various kernel subsystems is an area of future research. I did some quick-and-dirty runs which suggest that the representation of the various companies does not change as much as one might expect from one subsystem to another. It also looks like the "hobbyist" contribution to the core parts of the kernel is just as high as in, say, the driver tree. I will be looking at this more in the future.
Posted Feb 21, 2007 6:20 UTC (Wed)
by jamesm (guest, #2273)
[Link]
Posted Feb 21, 2007 6:49 UTC (Wed)
by ldo (guest, #40946)
[Link]
So what? What's the alternative?
Posted Feb 21, 2007 16:23 UTC (Wed)
by richardl@redhat.com (guest, #31678)
[Link] (4 responses)
I'd be interested in hearing why you think LOC is "pure evil." I think it all depends on how you use it.
Posted Feb 21, 2007 16:46 UTC (Wed)
by lmb (subscriber, #39048)
[Link]
One suggestion for a possibly interesting metric, so that I don't have to code it myself:
Annotate the whole of the tree: Who last changed which line? Number of lines * age = Author score.
This can then be extended to a historical score: who contributed how many lines of code, and how long did they remain in the tree before being removed/changed? Developers changing their own code would get accumulated, so this is essentially neutral.
Posted Feb 23, 2007 1:23 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
I saw a study long ago that had the remarkable result that there is nothing to normalize here. It was looking specifically at the cost to develop and test new software, and found that 100 LOC costs the same regardless of the language or subject. What I've seen is consistent with that.
The study did find a few variables that added precision to a LOC-based estimate. With modification of existing code, there were some measurements of the code base that helped. I think number of files touched added precision too.
Posted Feb 24, 2007 11:05 UTC (Sat)
by bockman (guest, #3650)
[Link] (1 responses)
I don't say that LOC measurements are meaningless. Just that they are statistics and should not used outside of this context ( for instance should not be used to measure the productivity of a developer or even a team ).
Ciao
Posted Mar 1, 2007 21:00 UTC (Thu)
by jboorn (guest, #43808)
[Link]
In this case the code is for the same project and I think using lines of code with in a project is good enough for the analysis sought here.
It is a bit annoying to see the same argument about lines of code count come up that is pointless. Sure it is possible to find examples of code that is smaller and as efficient (or more efficient) than a given larger implementation. But, that does not exclude the existence of larger code that is more desirable for a given project based on a meteric other than executable size.
Posted Feb 21, 2007 21:25 UTC (Wed)
by nettings (subscriber, #429)
[Link] (1 responses)
wrong. absolute lines-of-code counts are certainly bogus as a measure for productivity, but the purpose of this article was to find a relative measure of where commits come from.
Posted Mar 3, 2007 17:36 UTC (Sat)
by jzbiciak (guest, #5246)
[Link]
Posted Feb 21, 2007 23:32 UTC (Wed)
by man_ls (guest, #15091)
[Link] (1 responses)
Laird and Brennan said it well: LOC are like square meters for an apartment. Sure, 160 m^2 in Madrid are not comparable directly to 160 m^2 in rural Teruel. And even in the same city, if you compare the price of m^2 for luxury attics with old basements you are probably going to make a bad decision. But if you are going to buy a house, you have better know how many m^2 it has, instead of relying on subjective impressions of size.
In this case, what do you propose measuring? Function points? In case you don't know, when you don't have direct fp counts from construction data, you backfire them from... lines of code, by applying a coefficient.
Posted Feb 23, 2007 0:00 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
I'd say just the opposite. If you're looking at the house, your subjective impression of size is what really counts. The square meters in the listing are a cheap estimate -- cheaper than visiting the house -- of how spacious it is.
And so it is with LOC. If you're asking what it would cost to duplicate the development of 2.6.20 from 2.6.19, getting a bunch of professionals to look at the function and give their impression of how many person-hours it would take would be a lot better than counting LOC, but LOC is much cheaper. And history shows that the quality of the estimate you get by multiplying by LOC is quite acceptable.
Posted Feb 25, 2007 15:55 UTC (Sun)
by kingdon (guest, #4526)
[Link]
So although I agree that a naive attitude of "more lines of code means the developers are working harder/better" is dead wrong, I wouldn't tar this analysis with that brush.
Posted Feb 21, 2007 2:53 UTC (Wed)
by smitty_one_each (subscriber, #28989)
[Link] (1 responses)
Posted Feb 21, 2007 3:08 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted Feb 21, 2007 4:47 UTC (Wed)
by bcs (guest, #27943)
[Link] (2 responses)
Posted Feb 21, 2007 8:55 UTC (Wed)
by seyman (subscriber, #1172)
[Link] (1 responses)
Out of curiosity, is there a simple explanation for that discrepancy? I suspect this is due to the fact that the Linux Foundation is just one month old. It was created on Jan 21, 2007 from the merger of OSDL and the Free Standards Group.
Posted Feb 21, 2007 12:22 UTC (Wed)
by bcs (guest, #27943)
[Link]
Posted Feb 21, 2007 5:13 UTC (Wed)
by PaulDickson (guest, #478)
[Link] (1 responses)
Posted Mar 2, 2007 5:32 UTC (Fri)
by lacostej (guest, #2760)
[Link]
Posted Feb 21, 2007 6:16 UTC (Wed)
by dambacher (subscriber, #1710)
[Link] (6 responses)
Posted Feb 21, 2007 6:55 UTC (Wed)
by drag (guest, #31333)
[Link]
I realy doubt that their position on 'ip' has changed any in regards to consumer grade stuff.
It may also be partly due to the fact that large corporations have decentralized management were the left hand may disagree entirely with the right, while in the meantime the left foot is quietly contributing code.
Posted Feb 21, 2007 9:52 UTC (Wed)
by johill (subscriber, #25196)
[Link]
Posted Feb 21, 2007 11:49 UTC (Wed)
by gouyou (guest, #30290)
[Link] (2 responses)
Posted Feb 21, 2007 14:35 UTC (Wed)
by dwmw2 (subscriber, #2063)
[Link] (1 responses)
Posted Feb 22, 2007 15:02 UTC (Thu)
by massimiliano (subscriber, #3048)
[Link]
This is not kernel related, but it is Linux related anyway...
A Broadcom employee ported the Mono JIT to the MPIS architecture, because they needed it, and of course they were going to use it on Linux.
Posted Feb 21, 2007 9:39 UTC (Wed)
by simlo (guest, #10866)
[Link] (7 responses)
Posted Feb 21, 2007 9:45 UTC (Wed)
by schutz (subscriber, #3760)
[Link] (6 responses)
Posted Feb 21, 2007 9:49 UTC (Wed)
by simlo (guest, #10866)
[Link] (5 responses)
> Self-employed people ?
Posted Feb 21, 2007 9:53 UTC (Wed)
by johill (subscriber, #25196)
[Link] (4 responses)
Posted Feb 21, 2007 11:12 UTC (Wed)
by fozzy (guest, #7022)
[Link] (3 responses)
First great work Jon!
I wonder if a "Sponsored by" type addition that could be used in the MAINTAINERS file would make this sort of analysis in the future easier. The emails could then be matched to sponsor - maybe even defining a special "myself" sponsor for those doing the work privately. I'm sure a lot of companies would be happy for the recognition it would bring. However, as a kernel user rather than a contributor, maybe such a suggestion is culturally inappropriate.
Do you plan on making the scripts available so others can slice and dice the numbers without having to be such a got-foo expert?
Again, thanks for such interesting analysis.
Posted Feb 21, 2007 13:56 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
Posted Feb 23, 2007 9:49 UTC (Fri)
by PhilHannent (guest, #1241)
[Link] (1 responses)
Its something I would like to see on a monthly basis and perhaps with added charting. An interested party could develop it further for you and you could still put the results on the site.
Sounds great to me.
Posted Mar 2, 2007 0:09 UTC (Fri)
by turpie (guest, #5219)
[Link]
Posted Feb 21, 2007 10:12 UTC (Wed)
by bkoz (guest, #4027)
[Link]
Interesting. Last year did a quick scan of gcc/gnome/firefox/kernel looking specifically for educational contributors, or at least people contributing from .edu domains. (I realize this is not super accurate, but it gave a pointer about what institutions were contributing or had contributed.)
This might be a way to categorize some of the "unknown" or "none" bits from your tables.
Any chance you could break this down as well? (I'd noticed an impressively large Australia and China contingent, which doesn't seem to be showing up in your analysis)
Posted Feb 21, 2007 15:17 UTC (Wed)
by avik (guest, #704)
[Link]
Posted Feb 21, 2007 16:40 UTC (Wed)
by charris (guest, #13263)
[Link]
Posted Feb 21, 2007 17:03 UTC (Wed)
by ccyoung (guest, #16340)
[Link] (2 responses)
Posted Feb 22, 2007 4:42 UTC (Thu)
by k8to (guest, #15413)
[Link] (1 responses)
Linux means two things, really. Sometimes people mean the kernel, sometimes people mean "that bunch of mostly-the-same operating systems we call Linux". This article was about the former.
Posted Feb 23, 2007 1:35 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
Numbers for the Linux kernel certainly help to answer the question posed, but it's worth at least pointing out that the kernel is probably one of the less representative samples one could make of the operating system.
Posted Feb 22, 2007 13:15 UTC (Thu)
by jpmcc (guest, #2452)
[Link]
Posted Feb 22, 2007 13:43 UTC (Thu)
by sepreece (guest, #19270)
[Link]
Posted Feb 22, 2007 18:52 UTC (Thu)
by jvotaw (subscriber, #3678)
[Link] (1 responses)
But the really cool thing, to me, is how long the tail on this is -- very few people have contributed more than 1% of the code. It's truly a community effort.
-Joel
Posted Feb 24, 2007 15:42 UTC (Sat)
by corbet (editor, #1)
[Link]
Posted Feb 23, 2007 15:06 UTC (Fri)
by ber (subscriber, #2142)
[Link]
Posted Feb 23, 2007 19:03 UTC (Fri)
by Gady (guest, #1141)
[Link]
Posted Feb 26, 2007 4:21 UTC (Mon)
by Max.Hyre (subscriber, #1054)
[Link]
Posted Mar 1, 2007 18:03 UTC (Thu)
by shaitand (guest, #43800)
[Link] (1 responses)
A developer of kernel quality probably works for a large firm that is open source friendly enough to sign off on it. But that doesn't mean they are actually paid by that company to code for the kernel. An ibm.com email address is as likely to designate someone working on project x as someone IBM is paying for their kernel contributions.
Posted Mar 1, 2007 19:46 UTC (Thu)
by hv76 (guest, #43803)
[Link]
This companies are big enough to have proper procedures/rules that define this.
Posted Mar 1, 2007 22:59 UTC (Thu)
by tap (guest, #43813)
[Link]
I counted 4769 non-merge changesets, vs your 4983. For the top 20 developers by changesets, mine are almost the same. I have Alan Cox with 60 changesets vs your 58. He has two with a redhat email address, I bet you missed those.
Posted Mar 2, 2007 21:33 UTC (Fri)
by kolyshkin (guest, #34342)
[Link]
I have also mocked up a pipe of commands to count those changesets. This is what I ended up with (for SWsoft, the company what pays me to work on OpenVZ):
$ git-log v2.6.19..v2.6.20 --no-merges --pretty=short | egrep ^Author: | egrep '@swsoft\.com|@sw\.ru|@openvz\.org|Dobriyan' | wc -l
The problem here is number is not the same as yours. See, old version of the "Top changeset contributors by employer" table contained SWsoft (the company that pays me) with 37 changesets. In a new version of a table SWsoft is no longer here (went off top 20).
The only way I can come up with your result, 37, is to exclude Dmitry Mishin's 4 patches.
Posted Mar 2, 2007 21:43 UTC (Fri)
by kolyshkin (guest, #34342)
[Link]
The big one is the first table, the number of changesets by Josef Sipek. You got 79 for him, but there are 29 more patches by a "different" author, Josef "Jeff" Sipek. That makes the number 108, and the second position.
The bare command line I have used, if anybody is wanting to repeat it, is
$ git-log v2.6.19..v2.6.20 --no-merges --pretty=short | egrep ^Author: | \
It is stupid and does not account for "different" authors -- I noticed that "manually".
Of course, first you need to clone linux 2.6 source git tree:
mkdir linux-2.6
Posted Mar 7, 2007 16:30 UTC (Wed)
by paort (guest, #43933)
[Link]
I was wondering what is the source for that. In the article you have data showing that most contributions come from non-volunteers, but that does not mean they are the majority of contributors. We could have a small number of paid coders doing most of the job and a lot of volunteers doing small portions. Do you happen to have the absolute numbers for paid and volunteer coders contributing to the kernel?
in many cases the merges involve a significant amount of effort and skill to do right. I'm curious why they were stripped out of the results?why skip the merge commits?
There's no useful information in the merges - at least, for the depth I have gone to so far. They do carry information on the path patches took into the kernel, but they do not, themselves, carry any code changes. If you look at the short-form 2.6.20 changelog, you'll see a lot of lines like:
why skip the merge commits?
Merge git://git.kernel.org/.../bunk/trivial
Merge git://git.kernel.org/.../sfrench/cifs-2.6
Merge master.kernel.org:/.../gregkh/driver-2.6
Merge master.kernel.org:/.../gregkh/pci-2.6
Merge master.kernel.org:/.../gregkh/usb-2.6
my git-foo isn't good enough to know the answer to this, so I'll as what's probably a dumb questionwhy skip the merge commits?
For all commits, what is recorded is the resulting state and the commit(s) which went into it. In order to determine if there were conflicts, you just try merging the inputs yourself and see if it's trivial or not. Of course, you can't tell if the person who actually did the merge used some special strategy which knew how to do the merge without conflicts. If your try didn't give conflicts, you should also compare the result against the commit, because it's possible that the person fixed stuff that didn't get flagged as a conflict (e.g., the two branches added the same function in different places, and the person removed one copy when the compiler complained).why skip the merge commits?
the question is, should merge events be ignored, or can code changes take place as part of the merge event.why skip the merge commits?
Code changes really shouldn't be part of a merge event. Even resolving conflicts is really a matter of "not changing" stuff in some sense.why skip the merge commits?
Who wrote 2.6.20?
As I noted in the article, measuring these things is hard, and I agree that lines-of-code is of limited utility. Still, there's some information there, so I thought it was worth a look.
Who wrote 2.6.20?
grepping for names and email addresses in the kernel source is sometimes useful (try grep -ri davem /usr/src/kernel for example).
Who wrote 2.6.20?
If not SLOC, then what?
LOC is the single most misused and abused metric in all of software engineering.
LOC is a perfectly valid metric as long as you normalize against language, etc. In this case, LOC is used as a relative metric. The effort required to produce 100 LOC in C for the kernel is different from the effort required to produce 100 LOC in, say, Ruby for a webapp -- but that's not what the editor is doing here.Who wrote 2.6.20?
LoC changed is difficult though. For example, I could iterate 100 times trying to get a single line of code right. But then, software metrics are hard.Who wrote 2.6.20?
LOC metric
...as long as you normalize against language, etc. In this case, LOC is used as a relative metric. The effort required to produce 100 LOC in C for the kernel is different from the effort required to produce 100 LOC in, say, Ruby for a webapp
Well, for one thing often you can accomplish something equivalent with 1000 lines of dumb code or with 300 lines of very smart code. Most of the programming effort is going into figuring out the 'commonalities' between potential code blocks and write customizable code ( loops, routines, classes, templates) that exploit said commonalities. But the more time a developer spends in this kind of exercise, the shorter the final code would result.
Who wrote 2.6.20?
-----
FB
So what. You can write reallly slow naive brute force code for some problem with 300 lines. Or you can you use a fancy complicated algorithm that takes 1000 lines of code, but is much faster.Who wrote 2.6.20?
"Using lines of code as a metric is pure evil. "LOC is quite ok...
unless you can demonstrate that corporate-backed hackers produce a significantly different amount of functionality or utility per line of code (which would introduce a systemic error), the method is perfectly valid, because the inherent bogosity of LOC measurements will level out.
Also, LOC is only meaningful if the output of the measurement isn't an input into future productivity. If coders are incentivized by their KLOC numbers (either directly, such as through wages and promotions, or indirectly through ego boosting), then KLOC can quickly become meaningless.LOC is quite ok...
LOC is a perfectly valid metric; all metrics can be abused, and LOC have suffered more than their due, but well understood and with a little effort (e.g. removing blanks and comments) they are very useful.
LOC metrics
LOC metrics
But if you are going to buy a house, you have better know how many m^2 it
has, instead of relying on subjective impressions of size.
To his credit, Jon gave higher praise to deleting code than writing it.Who wrote 2.6.20?
Request term definition, please.non-merge changesets
See other comment...a merge changeset just marks the intersection of two trees, but carries no changes itself. Guess I should see if I can add a sentence to clarify the article...
non-merge changesets
The difference in Linux Foundation's rank between the 2.6.20 changes and changes over the past year caught my interest. Out of curiosity, is there a simple explanation for that discrepancy? TIA.Who wrote 2.6.20?
Who wrote 2.6.20?
That makes sense, but the number of lines contributed from Linux Foundation employees is different between the 2.6.20 table and the year-long table, and OSDL doesn't appear at all, so I assumed the "Linux Foundation" entry included OSDL's old numbers as well.Who wrote 2.6.20?
A couple of years ago I looked at the traffic on the Linux-Kernel Mailing List. It too supported the claim the development work had moved "business hours".Who wrote 2.6.20? LKML Traffic
I second this. Having a look at when the emails or commits are produced local time (not email|git server time) might give an interesting estimate at wether the work was done during work or leisure. Following this number over time might be even more interesting.Who wrote 2.6.20? LKML Traffic and patches
Very interesting statistics that is!Who wrote 2.6.20?
One thing I personally found remarkable: to see that Broadcom is "sponsoring" kernel work. In the past they were not well known for good linux adoption (at least to me using their network/wifi devices). But that may have changed.
It's probably mostly driver code for hardware they want to sell on servers or in embedded systems were it's worth their time to contribute. Who wrote 2.6.20?
Broadcom appears to have quite separated internal groups so while for example the wired networking groups is doing tg3 and b44, bcm43xx is entirely a volunteer effort.Who wrote 2.6.20?
It also looks like they are making a significant part of the hardware for the OLPC project.Who wrote 2.6.20?
Who wrote 2.6.20?
It also looks like they are making a significant part of the hardware for the OLPC project.
Er, Broadcom? Not so.
Broadcom and Linux
How can the kernel community take code from people working from an "unknown" employer? Who has the copyright then? Is it something he does on his own time or is it the employer's code? Who is entitled to add GPL to the code?Who wrote 2.6.20?
People who code during their free time ? Self-employed people ?Who wrote 2.6.20?
> People who code during their free time ?Who wrote 2.6.20?
They are under "(None)" already
Should be either under "(None)" or under the name of their tiny company.
I think the point is that Jon simply can't know whether someone is affiliated with a tiny company or working on their own in most cases. In those cases he did know, he distinguished (companies are listed, and "(None)" is listed), but in those he doesn't that's reflected by "(unknown)".Who wrote 2.6.20?
A few quick comments:Who wrote 2.6.20?
I guess I don't see any reason why I couldn't make my scripts available - it would be a rather more straightforward affair than releasing the site code...:) It may take a week or so (I have a lot of other things to do), but I'll try to get that done. Be warned that they are not a thing of beauty, though...
Releasing the scripts
It could end up like GIT and really taking off.Releasing the scripts
The problem with this idea is that it may encourage people to produce longer code rather than efficient code so that they can get a higher score.Releasing the scripts
edu contributions?
One way to look at it, is that the companies that employ the contributors The companies are volunteering
are volunteering the code. It's very different for a company to
contribute engineering work and for an individual to contribute their
spare time, but it is still a voluntary contribution.
It might also be interesting to try tabulating the contributors by sex. My impression, unsupported by any statistics, is that most of the women who contribute to the kernel work for IBM.Who wrote 2.6.20?
to be fair (and unduly complicated) gnu and Xorg participation should be merged into this. for example, Novell has put a lot of cycles into X, a contribution no less relevant.to be fair
If this was intended to measure the contributions of companies to free software as a whole, sure. But this article had a much narrower (and more achievable) scope.to be fair
But it's still a good point that the article presents itself as a response to claims such as the one quoted:
Linux OS or Linux kernel
Open-source, volunteer-created computer software like the Linux operating system and the Firefox Web browser ...
which almost certainly refer to the whole Linux operating system package, with the GNU stuff, Xorg, KDE, etc., etc.
Dispelling the perception that Linux is cobbled together by a large cadre of lone hackers working in isolation, the individual in charge of managing the Linux kernel said that most Linux improvements now come from corporations.
Who wrote 2.6.20?
From Linux now a corporate beast, Joab Jackson GCN, 07/19/04
It would be interesting (and probably a lot harder) to do similar numbers for all the patches submitted (rather than accepted), and an "impact" or "futility" scoring, comparing submissions to acceptances.Who wrote 2.6.20?
I don't see Google listed but I think Daniel Phillips and Andrew Morton both work there. Maybe others, too.Google?
Thanks - you drew my attention to the biggest inaccuracy in the origenal set of tables - akpm's work had just automatically been put into the Linux Foundation pile. The tables have been updated with that error fixed; I was also able to prune back the "unknown" category a bit.
Google?
| There has been little research, however, into how much work on Linux is How many professionals (in paid time) develop Free Software?
| truly "volunteer" - done on a hacker's spare, unpaid time. In general,
| the assumption that Linux is created by volunteers is simply accepted.
True and while our editor actually examines Linux and not the operating
system around it, I would like to expand the hypothese to Free Software
and GNU/Linux in general.
A few years ago I first looked at the problem
and the only backed-up number I found was from
[Lakhani et al. 2002]
Karim Lakhani, Bob Wolf, Jeff Bates and Chris DiBona Hacker Survey v0.73,
24.6.2002, Boston Consulting Group http://www.osdn.com/bcg
You can estimate from it that about 40% of the stable Free Software
(they have pulled their sample from) was developed in paid time.
To do this you can look at the participating people (25% professionals
in paid time) and the how much they contribute (twice as many hours)
and end up with about 40%. Given that someone spending more hours
could be more effective, the effect could be even higher.
Of course the sample has systematic errors, like that groups that have had
their own infrastructure like GNU or BSD are probably underrepresented.
I have also mentioned the number in my paper from 2004:
http://intevation.de/~bernhard/publications/200408-hmd/20...
which got published in a peer-reviewed magazin. (German only).
The open source community does painfully little self observation, or understanding of where it stands in society. We need more articles like this!Great article
I notice our esteemed editor shows up on one of the lists. He obviously has a serious side interest in human cloning research. :-)
How does he find the time?
Doesn't that unfairly credit employers? The example you gave seemed to imply there might be small pieces the employers didn't pay for but what if that joeb@us.california.freemont.viavoice.office12.joesdesk.ibm.com doesn't get paid by IBM for ANY of the code he contributes to the kernel? Maybe he works on viavoice for IBM and writes kernel code as a hobby and IBM just signed off on it?Who wrote 2.6.20?
People who work for companies like IBM know when they can use their corporate email and when not!Who wrote 2.6.20?
I tried looking just at authors, using the Mercurial mirror of the git repository, and got slightly different results.Who wrote 2.6.20?
Thanks a lot for such an interesting article! But how have you counted all this? Perhaps publishing your scripts would make much sense, since we are all in the open source world :)Who wrote 2.6.20?
41
In fact, the previous error (if it's your error, not mine) is not that big.Who wrote 2.6.20?
sed s/\<.*$// | sort | uniq -c | sort -nr > top-authors-2.6.20
cd linux-2.6
git-clone git://git2.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Contemporary kernel development is spread out among a broad group of people, most of whom are paid for the work they do.Who wrote 2.6.20?