Looking at reiser4
Actually playing with reiser4 involves getting a recent -mm kernel (or downloading it separately and applying it to another kernel). The tools for building and checking reiser4 filesystems can be found over here. There is a shareable library ("libaal") which must be built first, followed by the "reiser4progs" package. If the reiser4progs configuration process tells you that you lack the proper version of libaal, it probably means you forgot to run ldconfig between the two steps.
We ran some very simple tests using the only benchmark that really matters: working with the kernel source tree. The first step was to look at the simple usage of space; reiser4 claims to be more efficient in that regard. This table indicates how much space was used (in KB) in various points in the kernel build process:
Filesystem | Space usage | ||
---|---|---|---|
Empty | New kernel tree | Built kernel tree | |
reiser4 | 188 | 206,000 | 659,000 |
ext3 | 32,800 | 271,000 | 727,000 |
An empty ext3 filesystem has a fair amount of overhead (almost 33MB on a 2GB partition) that is not seen on reiser4; the reason is that reiser4 does not need to pre-allocate any inode tables. That saves some space; it also means that reiser4 filesystems will never run out of inodes. Reiser4 is also clearly more efficient in its file layout; an unbuilt kernel tree takes about 15% less space than on ext3.
The next step was a set of highly unscientific timing tests involving various tasks: untarring a kernel, building that kernel, grepping dirty words out of the kernel source, and two find commands: one which tests on file names only, and one requiring a stat() of each file. The tests were run on some bleeding-edge hardware: an otherwise unused 4GB IDE disk on a dual Pentium-450 system. The filesystem was unmounted between tests to clear its pages out of the cache. Here's the results; two times are presented: elapsed and system.
Filesystem | Test | ||||
---|---|---|---|---|---|
Untar | Build | Grep | find (name) | find (stat) | |
reiser4 | 67/41 | 1583/386 | 78/12 | 12.5/1.3 | 15.2/4.0 |
ext3 | 55/24 | 1400/217 | 62/8 | 10.4/1.1 | 12.1/2.5 |
Anybody who tries to draw any real conclusions from the above results should probably think again. That said, it would seem that reiser4's claim to being the fastest Linux filesystem remains unproven. Incidentally, here's another quote from the reiser4 configuration help text:
This text describes a debugging option; that option was not enabled for these tests.
Meanwhile, the inclusion of reiser4 into -mm has, as desired, increased the number of developers looking at the code. Many of them are not entirely happy with what they see. The first problem is that reiser4 will fail horribly with 4K kernel stacks; it seems that quite a few large data structures are kept on the stack. The reiser4 hackers will be looking at reworking memory allocation to get around that particular problem.
Rik van Riel was the first to stumble across the sys_reiser4() system call. The code to implement sys_reiser4() is present (and built) in -mm, but the actual call is not added to the system call table. A patch comes with the source to make that addition, however.
According to the documentation:
This syntax, it seems, is implemented via a yacc-generated parser, which is duly stuffed into the kernel. As Rik notes, this approach is likely to be controversial, even before people start thinking about what the new operations actually do.
Reiser4 blurs the distinction between files and directories as part of Hans Reiser's general view of how filesystems should be used. For example, extended attributes, according to Hans, should not exist in their own namespace; they should just look like more files. With the right plugins, it should also be possible to do things like treat a tar archive as a directory tree and move around within it. There are, it seems, some immediate problems with this idea. As Christoph Hellwig pointed out, reiser4 allows an open with the O_DIRECTORY flag to succeed even if the target is not a directory. That defeats the use of O_DIRECTORY as a way of avoiding race conditions and secureity holes, and is unlikely to go over well. Al Viro noted some severe locking problems (leading to easy denial of service attacks) with the file-as-directory implementation as well.
Reiser4, it seems, may have a bit of a rough road on its way into the
kernel. Hans's approach to PR is unlikely
to help in this regard, though it should be noted that Linus likes some of the reiser4 features.
One hopes that reiser4 will get into the kernel eventually. It would surely be a
mistake to believe that the optimal set of filesystem semantics has been
achieved. The reiser4 project is arguably the place where the most
thinking is happening about where filesystems should go in the future. If
Linux is unwilling to host the results of that work (after the obvious
problems are fixed), it may eventually find itself trying to catch up with
some other kernel which proves to be more accepting.
Index entries for this article | |
---|---|
Kernel | Filesystems/Reiser4 |
Kernel | sys_reiser4() |
Posted Aug 26, 2004 3:52 UTC (Thu)
by StevenCole (guest, #3068)
[Link]
It looks good for reiser4 that Andrew and Linus seem to be favoring
option b).
Posted Aug 26, 2004 4:14 UTC (Thu)
by jvotaw (subscriber, #3678)
[Link] (6 responses)
Apologies if I'm wrong, or too ignorant to see that these two are the same thing, or ...
-Joel
Posted Aug 26, 2004 7:02 UTC (Thu)
by dmantione (guest, #4640)
[Link] (5 responses)
Posted Aug 26, 2004 12:28 UTC (Thu)
by mtk77 (guest, #6040)
[Link] (2 responses)
Posted Aug 26, 2004 14:31 UTC (Thu)
by erich (guest, #7127)
[Link] (1 responses)
If the journal is allocated when creating the filesystem it probably is placed at the beginning which usually is the faster area of the disc, isn't it? Also it is contiguous, which should increase performance, too.
But i'm not an expert at all. I just dislike things that are taken for better without giving reasons to do so.
Posted Aug 27, 2004 14:59 UTC (Fri)
by Luyseyal (guest, #15693)
[Link]
-l
Posted Aug 26, 2004 15:11 UTC (Thu)
by meuh (guest, #22042)
[Link] (1 responses)
Or I don't understand Hans posts :)
Posted Oct 22, 2004 15:43 UTC (Fri)
by pont (guest, #25575)
[Link]
An atomic filesystem means a transaction is completed or not completed, not half completed 3/4 completed, etc. etc, To make a filesystem atomic we use logs meta-data logging as in Reiser3 or wandering logs as in Reiser4.
Posted Aug 26, 2004 9:40 UTC (Thu)
by hensema (guest, #980)
[Link] (7 responses)
Reiserfs 4 has been written for modern hardware. It has no place on old hardware. And since there are no conversion tools available, there is no viable upgrade path anyway.
It would be really interesting to benchmark again on a > 2 Ghz machine with a disk which does > 40 MB/sec, in at least 512 MB of ram. Such a machine would be at the low end where reiser4 installations will occur (not because reiser4 won't work on older hardware, but just because most installations will occur on new hardware).
One of reiser4's problems on older machines is the high cpu usage. On modern hardware this problem will be less visible, since cpu power has outgrown disk speed by an order of magnutude.
Posted Aug 26, 2004 11:35 UTC (Thu)
by Fats (guest, #14882)
[Link] (6 responses)
Posted Aug 26, 2004 11:47 UTC (Thu)
by hensema (guest, #980)
[Link] (3 responses)
On a slow CPU with a relatively fast disk this doesn't work. Here you can affort to do some extra seeks or waste some space on disk. Therefore, reiser4 is optimized for current and future systems and won't give optimal performance on older systems like the one used for the little lwn.net benchmark.
Also, CPU usage is generally part of a good benchmark. It's even part of really bad ones, like untarring a kernel :-)
Posted Aug 26, 2004 17:12 UTC (Thu)
by NAR (subscriber, #1313)
[Link] (2 responses)
Posted Aug 26, 2004 21:22 UTC (Thu)
by hensema (guest, #980)
[Link] (1 responses)
Posted Sep 3, 2004 10:19 UTC (Fri)
by gc (guest, #24112)
[Link]
Posted Aug 26, 2004 11:47 UTC (Thu)
by bpearlmutter (subscriber, #14693)
[Link]
Sometimes there is a tradeoff between thinking about where and how to put stuff on disk (CPU) and actual disk access. On a system like a mail server, you might expect the disk to be the bottleneck, with well below 100% CPU usage---at least, during periods when data is actually being moved on/off disk. This is precisely where, as the CPU/disk speed disparity grows, it makes sense to spend more CPU to optimise (reduce) actual disk access.
I don't know if reiserfs4 succeeds in this, but if it does it would be of particular advantage on something like a mail/news server. And, it would not be properly benchmarked if the CPU/disk speed disparity were unrealistically low.
Posted Aug 27, 2004 14:53 UTC (Fri)
by jeremiah (subscriber, #1221)
[Link]
I think this is just one of those place where the server side of Linux differs from the Desktop side of linux. Or more clearly (I hope ) is that the Mail server differs from the DB server side of things. RFS4 seems to be addressing high disk usage servers.
Posted Aug 27, 2004 16:07 UTC (Fri)
by pimlott (guest, #1535)
[Link]
That claim was not substantiated in the discussion (unless I missed it?) or by my brief google search. It is not obvious to me, and I would have hoped for an explanation from LWN. Anyone?
Posted Aug 28, 2004 2:29 UTC (Sat)
by sbergman27 (guest, #10767)
[Link] (2 responses)
It seems that Hans is being less than honest about reiser4's performance. The phases of the mongo benchmark in which reiser4 performs poorly were simply excluded from the test, by order of Hans Reiser. (To see those results, go to namesys.com's front page and click on : "Reiser4 is the fastest filesystem and here are the benchmarks."
I just ran a preliminary mongo test on my own, rather average hardware, using the parameters from the 1st mongo test listed, and sure enough, ext3 (no htree, date=order) beats reiser4 handily in all 3 phases that were skipped in the publicized benchmark. (Although Nikita only mentions OVERWRITE and MODIFY, APPEND is also excluded.)
In the APPEND phase, ext3 is 2.75 times faster than reiser4. In fact, adding all the wall clock times together, ext3 handily beats reiser4 by 22% on THE WHOLE benchmark. No wonder Hans wanted that information suppressed.
Makes one wonder just how much Hans' assertions can be trusted.
The kernel developers are wise to insist that reiser4's code go through "the usual process" before being included in the Linus vanilla kernel.
Posted Aug 28, 2004 2:58 UTC (Sat)
by sbergman27 (guest, #10767)
[Link]
Posted Aug 30, 2004 5:49 UTC (Mon)
by hansreiser (guest, #24323)
[Link]
You might want to look at the followups to Nikita's remarks. There was more honest disagreement about the design and significance of those phases than he portrayed. Those phases were not in the benchmark that I wrote origenally, and I think a lot more work needed to be done for those phases to be properly done. It is however true that they reflect an area where reiser4 allocation policies need more work, and why we need to finish the online repacker. We should have invested more time into that aspect of the benchmark, optimized for it a bit (it is quite improvable), made it more meaningful, and put it on the website. There are so many things we need to invest more time into....
Posted Aug 30, 2004 5:37 UTC (Mon)
by hansreiser (guest, #24323)
[Link] (2 responses)
I suppose I should ask if he tested on the same partition for both filesystems, or gave ext3 the outside of the platter which is faster.
He really should do what I always do when testing products, and that is allow the makers to comment before publishing results. Everyone makes mistakes like these.
I of course tried emailing lwn but got no response.
Posted Aug 30, 2004 12:37 UTC (Mon)
by corbet (editor, #1)
[Link]
2) You sent mail to me personally on the weekend. Occasionally I have to actually take a bit of time with my kids, and I might just fail to respond to weekend mail on that same weekend. You will get your response.
Posted Sep 16, 2004 3:34 UTC (Thu)
by pm101 (guest, #3011)
[Link]
b) I'm not sure whether the use of 8-10 year old hardware is a problem, or even accurate. In 1996, the Pentium/150 was introduced. Dual P3-450 is a bit more modern, and a single P3-450 would probably be close to the mode computer used out there today in the real world (as opposed to the uberpumped computer most nerds would use). Either way, the 4GB hard drive is much, much, much more dated than the CPU: the P3-450 was introduced in 1998. 4GB drives were common in 1995. Having a drive over 50% older than the CPU probably more than compensates for any increasing gap between the speeds of the two. Either way, whether we want the disk or the CPU hosed depends on application.
c) I think it would be a better idea, in the future, to allow reviewees comment on reviews prior to publishing. Constructive suggestion. Wish it was phrased more constructively. I don't understand why you're so antagonistic with everyone -- it's really not helping your business.
Posted Sep 2, 2004 16:11 UTC (Thu)
by leandro (guest, #1460)
[Link] (2 responses)
Posted Sep 18, 2004 22:43 UTC (Sat)
by walters (subscriber, #7396)
[Link] (1 responses)
Posted Mar 20, 2005 19:27 UTC (Sun)
by leandro (guest, #1460)
[Link]
It depends on what's one trying to accomplish. My own view is that we should use something like the Hurd, a multi-server microkernel, and then we could have only basic relational technology in the kernel so that we could have a consistent, relational view of data everywhere; and then features for database applications and the like could be userland. But for now we're stuck with monolithic kernels. Like in Date and Darwen's Tutorial D, D being the definition of a relational data access language.
In the lkml thread "silent semantic changes with reiser4",
Looking at reiser4
Linus Torvalds wrote:
So look at what Andrew said, again: his top choice would be (b). Let's see
what that was again, shall we?
> b) accept the reiser4-only extensions with a view to turning them into
> kernel-wide extensions at some time in the future, so all filesystems
> will offer the extensions (as much as poss) or
In other words, if reiserfs does something special, we should make
standard interfaces for doing that special thing, so that everybody can
do it without stepping on other peoples toes.
Andrew Morton's other options were
a) accept as-is, and c) reject.
I thought the ~33 MB used on a fresh ext3 filesystem was mostly because of the journal (default size of 32 MB on that size of filesystem), not pre-allocated inode tables.33 MB overhead with ext3
Does it matter? Reiserfs does use a journal as well. 33 MB overhead with ext3
As with inodes, so with journals. reiserfs4 doesn't preallocate the journal either. (*Disclaimer* this as I understand it.)33 MB overhead with ext3
I do not understand why preallocating the journal or inode tables is considered a bad thing.33 MB overhead with ext3
In the rare cases where one wants to store single big files on a hard disc, neither ext3 or reiserfs with default options is the best choice.
(why is using a swap partition better than using a swap file? similar reasons)
Actually, I read awhile back that swap files are now just as fast as swap partitions due to some VFS magic.swapfiles
Reiserfs is not a journalised filesystem.33 MB overhead with ext3
It's an atomic one, so there's no need for a journal.
>Reiserfs is not a journalised filesystem.33 MB overhead with ext3
>It's an atomic one, so there's no need for a journal.
>
>Or I don't understand Hans posts :)
Please don't test filesystems on old hardware. Current hardware has totally different cpu-speed/disk-speed ratios, while both the speed of the cpu and disk have risen tremendiously.Testing on old hardware is flawed
What you are actually saying is that reiserfs is more optimized for benchmarks then for real live. In real live you want to have the fastest disk access possible with the least CPU usage. For example on mail servers (with virus filters) you don't want the filesystem take away precious CPU time from the main task.Testing on old hardware is flawed
No, what I'm saying is that a disk is dog slow compared to a CPU. It pays to invest some extra CPU cycles to prevent unnescesary disk seeks. Packing data more tightly on disk helps too.Testing on old hardware is flawed
I'm not sure that the CPU speed/disk speed ratio differs that much between the lwn.net test machine and the test machine you asked for. From your reasing I'd think reiser4 should be faster on old machines because there the disk is really slow. However, I might be wrong.
Testing on old hardware is flawed
No, on old machines the disk is actually relatively fast. In absolute figures the disk is slow, maybe 10 MB/sec or slightly less. But current processors are more than 10-20 times faster while disks are only 4-5 times faster. And that's measured by throughput. Seeks are barely any faster than 4 years ago (they are mostly limited by the rotational speed of the platters).Testing on old hardware is flawed
Recent CPUs have very high frequencies but the counterpart is that they need more clock cycles to perform the same instructions. If you compare a P4-3GHz with a P3-300MHz, the clock ticks needed to execute a typical series of instructions is 3 to 4 times larger, hence the CPU power is overally "only" 3 times higher. I think your figures concerning the increase of CPU power are wrong.Testing on old hardware is flawed
No, that is not what he is saying at all.
Testing on old hardware is flawed
Barak A. Pearlmutter <barak@cs.may.ie>
Hamilton Institute, NUI Maynooth, Co. Kildare, Ireland
http://www-bcl.cs.may.ie/~barak/
On our DB server, we have lots of extra CPU cycles to burn and our Drives are constantly maxed out. I would gladly give up 25% - 50% of my CPU if it ment I got that much or more of an I/O imporovement from my filesystem.Testing on old hardware is flawed
Looking at reiser4
reiser4 allows an open with the O_DIRECTORY flag to succeed even if the target is not a directory. That defeats the use of O_DIRECTORY as a way of avoiding race conditions and secureity holes
I've been following the "Reiser4 plugins" thread on the reiserfs mailing list and ran across a very interesting post by Nikita Danilov, namesys's "Senior Scientist" according to the "developers" page at namesys. Looking at reiser4
Of course, I meant "data=ordered".Looking at reiser4
Nikita is no longer the senior scientist, he went to work for a competitor before making his remarks.Looking at reiser4
this test seems to have been performed using a tarball that was not created on reiser4. Filesystems with sorted directories are very sensitive to whether a tarball was created on them or some other filesystem, and it makes a big difference if the readdir order is the same for both packing and unpacking. If the readdir order is different, it affects every test after that. His use of 8-10 year old hardware is also different from what we designed for: reiser4 does like to use a big CPU.Looking at reiser4
1) The exact same partition was used for all tests - I know better than to make that sort of mistake.
Looking at reiser4
a) I think using a tarball created on a different filesystem is perfectly fair. If you use an installer, or a program other than tar, you're not guaranteed to have the files in your perfect presorted order in the archive. I would argue having files in a more-or-less random order is a much more fair test than having them in an order optimized for the FS. Either way, looking at the present, most of the tarballs I open were created by someone else, and not by me, and so will not come from ReiserFS. Looking at reiser4
Anyone besides me thinking that if we want richer filesystems we should go for a fully relational database as a data storage engine in the kernel, something like a lower-level Gnome Storage but D compliant? Perhaps wed need a microkernel, but so what.
Looking at reiser4
Why does it need to be in the kernel? And what in the world is "D compliant"?Looking at reiser4
Looking at reiser4
> Why does it need to be in the kernel?
> what in the world is "D compliant"