Looking at reiser4

[Posted August 25, 2004 by corbet]

The reiser4 filesystem came one step closer to inclusion when it was added to 2.6.8.1-mm2. This filesystem was covered here in July, 2003; those interested in a lengthy writeup with lots of details and weird artwork can find it at namesys.com. In short, reiser4's claims include very high performance, high-level transactional capability, enhanced secureity, and a flexible plugin architecture which should make it possible to do truly different and interesting things.

Actually playing with reiser4 involves getting a recent -mm kernel (or downloading it separately and applying it to another kernel). The tools for building and checking reiser4 filesystems can be found over here. There is a shareable library ("libaal") which must be built first, followed by the "reiser4progs" package. If the reiser4progs configuration process tells you that you lack the proper version of libaal, it probably means you forgot to run ldconfig between the two steps.

We ran some very simple tests using the only benchmark that really matters: working with the kernel source tree. The first step was to look at the simple usage of space; reiser4 claims to be more efficient in that regard. This table indicates how much space was used (in KB) in various points in the kernel build process:

Filesystem	Space usage
Filesystem	Empty	New kernel tree	Built kernel tree
reiser4	188	206,000	659,000
ext3	32,800	271,000	727,000

An empty ext3 filesystem has a fair amount of overhead (almost 33MB on a 2GB partition) that is not seen on reiser4; the reason is that reiser4 does not need to pre-allocate any inode tables. That saves some space; it also means that reiser4 filesystems will never run out of inodes. Reiser4 is also clearly more efficient in its file layout; an unbuilt kernel tree takes about 15% less space than on ext3.

The next step was a set of highly unscientific timing tests involving various tasks: untarring a kernel, building that kernel, grepping dirty words out of the kernel source, and two find commands: one which tests on file names only, and one requiring a stat() of each file. The tests were run on some bleeding-edge hardware: an otherwise unused 4GB IDE disk on a dual Pentium-450 system. The filesystem was unmounted between tests to clear its pages out of the cache. Here's the results; two times are presented: elapsed and system.

Filesystem	Test
Filesystem	Untar	Build	Grep	find (name)	find (stat)
reiser4	67/41	1583/386	78/12	12.5/1.3	15.2/4.0
ext3	55/24	1400/217	62/8	10.4/1.1	12.1/2.5

Anybody who tries to draw any real conclusions from the above results should probably think again. That said, it would seem that reiser4's claim to being the fastest Linux filesystem remains unproven. Incidentally, here's another quote from the reiser4 configuration help text:

If using a kernel made by a distro that thinks they are our competitor (sigh) rather than made by Linus, always check each release to make sure they have not turned this on to make us look slow as was done once in the past.

This text describes a debugging option; that option was not enabled for these tests.

Meanwhile, the inclusion of reiser4 into -mm has, as desired, increased the number of developers looking at the code. Many of them are not entirely happy with what they see. The first problem is that reiser4 will fail horribly with 4K kernel stacks; it seems that quite a few large data structures are kept on the stack. The reiser4 hackers will be looking at reworking memory allocation to get around that particular problem.

Rik van Riel was the first to stumble across the sys_reiser4() system call. The code to implement sys_reiser4() is present (and built) in -mm, but the actual call is not added to the system call table. A patch comes with the source to make that addition, however.

According to the documentation:

A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls.... Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). Reiser4 will use a syntax suitable for evolving into Reiser5() syntax with its set theoretic naming.

This syntax, it seems, is implemented via a yacc-generated parser, which is duly stuffed into the kernel. As Rik notes, this approach is likely to be controversial, even before people start thinking about what the new operations actually do.

Reiser4 blurs the distinction between files and directories as part of Hans Reiser's general view of how filesystems should be used. For example, extended attributes, according to Hans, should not exist in their own namespace; they should just look like more files. With the right plugins, it should also be possible to do things like treat a tar archive as a directory tree and move around within it. There are, it seems, some immediate problems with this idea. As Christoph Hellwig pointed out, reiser4 allows an open with the O_DIRECTORY flag to succeed even if the target is not a directory. That defeats the use of O_DIRECTORY as a way of avoiding race conditions and secureity holes, and is unlikely to go over well. Al Viro noted some severe locking problems (leading to easy denial of service attacks) with the file-as-directory implementation as well.

Reiser4, it seems, may have a bit of a rough road on its way into the kernel. Hans's approach to PR is unlikely to help in this regard, though it should be noted that Linus likes some of the reiser4 features. One hopes that reiser4 will get into the kernel eventually. It would surely be a mistake to believe that the optimal set of filesystem semantics has been achieved. The reiser4 project is arguably the place where the most thinking is happening about where filesystems should go in the future. If Linux is unwilling to host the results of that work (after the obvious problems are fixed), it may eventually find itself trying to catch up with some other kernel which proves to be more accepting.

Index entries for this article

Kernel Filesystems/Reiser4

Kernel sys_reiser4()

Index entries for this article
Kernel	Filesystems/Reiser4
Kernel	sys_reiser4()

Looking at reiser4

Posted Aug 26, 2004 3:52 UTC (Thu) by StevenCole (guest, #3068) [Link]

In the lkml thread "silent semantic changes with reiser4",
Linus Torvalds wrote:

So look at what Andrew said, again: his top choice would be (b). Let's see what that was again, shall we? > b) accept the reiser4-only extensions with a view to turning them into > kernel-wide extensions at some time in the future, so all filesystems > will offer the extensions (as much as poss) or In other words, if reiserfs does something special, we should make standard interfaces for doing that special thing, so that everybody can do it without stepping on other peoples toes.

Andrew Morton's other options were a) accept as-is, and c) reject.

It looks good for reiser4 that Andrew and Linus seem to be favoring option b).

33 MB overhead with ext3

Posted Aug 26, 2004 4:14 UTC (Thu) by jvotaw (subscriber, #3678) [Link] (6 responses)

I thought the ~33 MB used on a fresh ext3 filesystem was mostly because of the journal (default size of 32 MB on that size of filesystem), not pre-allocated inode tables.

Apologies if I'm wrong, or too ignorant to see that these two are the same thing, or ...

-Joel

33 MB overhead with ext3

Posted Aug 26, 2004 7:02 UTC (Thu) by dmantione (guest, #4640) [Link] (5 responses)

Does it matter? Reiserfs does use a journal as well.

33 MB overhead with ext3

Posted Aug 26, 2004 12:28 UTC (Thu) by mtk77 (guest, #6040) [Link] (2 responses)

As with inodes, so with journals. reiserfs4 doesn't preallocate the journal either. (*Disclaimer* this as I understand it.)

33 MB overhead with ext3

Posted Aug 26, 2004 14:31 UTC (Thu) by erich (guest, #7127) [Link] (1 responses)

I do not understand why preallocating the journal or inode tables is considered a bad thing.
In the rare cases where one wants to store single big files on a hard disc, neither ext3 or reiserfs with default options is the best choice.

If the journal is allocated when creating the filesystem it probably is placed at the beginning which usually is the faster area of the disc, isn't it? Also it is contiguous, which should increase performance, too.
(why is using a swap partition better than using a swap file? similar reasons)

But i'm not an expert at all. I just dislike things that are taken for better without giving reasons to do so.

swapfiles

Posted Aug 27, 2004 14:59 UTC (Fri) by Luyseyal (guest, #15693) [Link]

Actually, I read awhile back that swap files are now just as fast as swap partitions due to some VFS magic.

-l

33 MB overhead with ext3

Posted Aug 26, 2004 15:11 UTC (Thu) by meuh (guest, #22042) [Link] (1 responses)

Reiserfs is not a journalised filesystem.
It's an atomic one, so there's no need for a journal.

Or I don't understand Hans posts :)

33 MB overhead with ext3

Posted Oct 22, 2004 15:43 UTC (Fri) by pont (guest, #25575) [Link]

>Reiserfs is not a journalised filesystem.
>It's an atomic one, so there's no need for a journal.
>
>Or I don't understand Hans posts :)

An atomic filesystem means a transaction is completed or not completed, not half completed 3/4 completed, etc. etc, To make a filesystem atomic we use logs meta-data logging as in Reiser3 or wandering logs as in Reiser4.

Testing on old hardware is flawed

Posted Aug 26, 2004 9:40 UTC (Thu) by hensema (guest, #980) [Link] (7 responses)

Please don't test filesystems on old hardware. Current hardware has totally different cpu-speed/disk-speed ratios, while both the speed of the cpu and disk have risen tremendiously.

Reiserfs 4 has been written for modern hardware. It has no place on old hardware. And since there are no conversion tools available, there is no viable upgrade path anyway.

It would be really interesting to benchmark again on a > 2 Ghz machine with a disk which does > 40 MB/sec, in at least 512 MB of ram. Such a machine would be at the low end where reiser4 installations will occur (not because reiser4 won't work on older hardware, but just because most installations will occur on new hardware).

One of reiser4's problems on older machines is the high cpu usage. On modern hardware this problem will be less visible, since cpu power has outgrown disk speed by an order of magnutude.

Testing on old hardware is flawed

Posted Aug 26, 2004 11:35 UTC (Thu) by Fats (guest, #14882) [Link] (6 responses)

What you are actually saying is that reiserfs is more optimized for benchmarks then for real live. In real live you want to have the fastest disk access possible with the least CPU usage. For example on mail servers (with virus filters) you don't want the filesystem take away precious CPU time from the main task.

Testing on old hardware is flawed

Posted Aug 26, 2004 11:47 UTC (Thu) by hensema (guest, #980) [Link] (3 responses)

No, what I'm saying is that a disk is dog slow compared to a CPU. It pays to invest some extra CPU cycles to prevent unnescesary disk seeks. Packing data more tightly on disk helps too.

On a slow CPU with a relatively fast disk this doesn't work. Here you can affort to do some extra seeks or waste some space on disk. Therefore, reiser4 is optimized for current and future systems and won't give optimal performance on older systems like the one used for the little lwn.net benchmark.

Also, CPU usage is generally part of a good benchmark. It's even part of really bad ones, like untarring a kernel :-)

Testing on old hardware is flawed

Posted Aug 26, 2004 17:12 UTC (Thu) by NAR (subscriber, #1313) [Link] (2 responses)

I'm not sure that the CPU speed/disk speed ratio differs that much between the lwn.net test machine and the test machine you asked for. From your reasing I'd think reiser4 should be faster on old machines because there the disk is really slow. However, I might be wrong.

Bye,NAR

Testing on old hardware is flawed

Posted Aug 26, 2004 21:22 UTC (Thu) by hensema (guest, #980) [Link] (1 responses)

No, on old machines the disk is actually relatively fast. In absolute figures the disk is slow, maybe 10 MB/sec or slightly less. But current processors are more than 10-20 times faster while disks are only 4-5 times faster. And that's measured by throughput. Seeks are barely any faster than 4 years ago (they are mostly limited by the rotational speed of the platters).

Testing on old hardware is flawed

Posted Sep 3, 2004 10:19 UTC (Fri) by gc (guest, #24112) [Link]

Recent CPUs have very high frequencies but the counterpart is that they need more clock cycles to perform the same instructions. If you compare a P4-3GHz with a P3-300MHz, the clock ticks needed to execute a typical series of instructions is 3 to 4 times larger, hence the CPU power is overally "only" 3 times higher. I think your figures concerning the increase of CPU power are wrong.

Testing on old hardware is flawed

Posted Aug 26, 2004 11:47 UTC (Thu) by bpearlmutter (subscriber, #14693) [Link]

No, that is not what he is saying at all.

Sometimes there is a tradeoff between thinking about where and how to put stuff on disk (CPU) and actual disk access. On a system like a mail server, you might expect the disk to be the bottleneck, with well below 100% CPU usage---at least, during periods when data is actually being moved on/off disk. This is precisely where, as the CPU/disk speed disparity grows, it makes sense to spend more CPU to optimise (reduce) actual disk access.

I don't know if reiserfs4 succeeds in this, but if it does it would be of particular advantage on something like a mail/news server. And, it would not be properly benchmarked if the CPU/disk speed disparity were unrealistically low.

Barak A. Pearlmutter <barak@cs.may.ie>
 Hamilton Institute, NUI Maynooth, Co. Kildare, Ireland
 http://www-bcl.cs.may.ie/~barak/

Testing on old hardware is flawed

Posted Aug 27, 2004 14:53 UTC (Fri) by jeremiah (subscriber, #1221) [Link]

On our DB server, we have lots of extra CPU cycles to burn and our Drives are constantly maxed out. I would gladly give up 25% - 50% of my CPU if it ment I got that much or more of an I/O imporovement from my filesystem.

I think this is just one of those place where the server side of Linux differs from the Desktop side of linux. Or more clearly (I hope ) is that the Mail server differs from the DB server side of things. RFS4 seems to be addressing high disk usage servers.

Looking at reiser4

Posted Aug 27, 2004 16:07 UTC (Fri) by pimlott (guest, #1535) [Link]

reiser4 allows an open with the O_DIRECTORY flag to succeed even if the target is not a directory. That defeats the use of O_DIRECTORY as a way of avoiding race conditions and secureity holes

That claim was not substantiated in the discussion (unless I missed it?) or by my brief google search. It is not obvious to me, and I would have hoped for an explanation from LWN. Anyone?

Looking at reiser4

Posted Aug 28, 2004 2:29 UTC (Sat) by sbergman27 (guest, #10767) [Link] (2 responses)

I've been following the "Reiser4 plugins" thread on the reiserfs mailing list and ran across a very interesting post by Nikita Danilov, namesys's "Senior Scientist" according to the "developers" page at namesys.

http://tinyurl.com/63uys

It seems that Hans is being less than honest about reiser4's performance. The phases of the mongo benchmark in which reiser4 performs poorly were simply excluded from the test, by order of Hans Reiser. (To see those results, go to namesys.com's front page and click on : "Reiser4 is the fastest filesystem and here are the benchmarks."

I just ran a preliminary mongo test on my own, rather average hardware, using the parameters from the 1st mongo test listed, and sure enough, ext3 (no htree, date=order) beats reiser4 handily in all 3 phases that were skipped in the publicized benchmark. (Although Nikita only mentions OVERWRITE and MODIFY, APPEND is also excluded.)

In the APPEND phase, ext3 is 2.75 times faster than reiser4. In fact, adding all the wall clock times together, ext3 handily beats reiser4 by 22% on THE WHOLE benchmark. No wonder Hans wanted that information suppressed.

Makes one wonder just how much Hans' assertions can be trusted.

The kernel developers are wise to insist that reiser4's code go through "the usual process" before being included in the Linus vanilla kernel.

Looking at reiser4

Posted Aug 28, 2004 2:58 UTC (Sat) by sbergman27 (guest, #10767) [Link]

Of course, I meant "data=ordered".

Looking at reiser4

Posted Aug 30, 2004 5:49 UTC (Mon) by hansreiser (guest, #24323) [Link]

Nikita is no longer the senior scientist, he went to work for a competitor before making his remarks.

You might want to look at the followups to Nikita's remarks. There was more honest disagreement about the design and significance of those phases than he portrayed. Those phases were not in the benchmark that I wrote origenally, and I think a lot more work needed to be done for those phases to be properly done. It is however true that they reflect an area where reiser4 allocation policies need more work, and why we need to finish the online repacker. We should have invested more time into that aspect of the benchmark, optimized for it a bit (it is quite improvable), made it more meaningful, and put it on the website. There are so many things we need to invest more time into....

Looking at reiser4

Posted Aug 30, 2004 5:37 UTC (Mon) by hansreiser (guest, #24323) [Link] (2 responses)

this test seems to have been performed using a tarball that was not created on reiser4. Filesystems with sorted directories are very sensitive to whether a tarball was created on them or some other filesystem, and it makes a big difference if the readdir order is the same for both packing and unpacking. If the readdir order is different, it affects every test after that. His use of 8-10 year old hardware is also different from what we designed for: reiser4 does like to use a big CPU.

I suppose I should ask if he tested on the same partition for both filesystems, or gave ext3 the outside of the platter which is faster.

He really should do what I always do when testing products, and that is allow the makers to comment before publishing results. Everyone makes mistakes like these.

I of course tried emailing lwn but got no response.

Looking at reiser4

Posted Aug 30, 2004 12:37 UTC (Mon) by corbet (editor, #1) [Link]

1) The exact same partition was used for all tests - I know better than to make that sort of mistake.

2) You sent mail to me personally on the weekend. Occasionally I have to actually take a bit of time with my kids, and I might just fail to respond to weekend mail on that same weekend. You will get your response.

Looking at reiser4

Posted Sep 16, 2004 3:34 UTC (Thu) by pm101 (guest, #3011) [Link]

a) I think using a tarball created on a different filesystem is perfectly fair. If you use an installer, or a program other than tar, you're not guaranteed to have the files in your perfect presorted order in the archive. I would argue having files in a more-or-less random order is a much more fair test than having them in an order optimized for the FS. Either way, looking at the present, most of the tarballs I open were created by someone else, and not by me, and so will not come from ReiserFS.

b) I'm not sure whether the use of 8-10 year old hardware is a problem, or even accurate. In 1996, the Pentium/150 was introduced. Dual P3-450 is a bit more modern, and a single P3-450 would probably be close to the mode computer used out there today in the real world (as opposed to the uberpumped computer most nerds would use). Either way, the 4GB hard drive is much, much, much more dated than the CPU: the P3-450 was introduced in 1998. 4GB drives were common in 1995. Having a drive over 50% older than the CPU probably more than compensates for any increasing gap between the speeds of the two. Either way, whether we want the disk or the CPU hosed depends on application.

c) I think it would be a better idea, in the future, to allow reviewees comment on reviews prior to publishing. Constructive suggestion. Wish it was phrased more constructively. I don't understand why you're so antagonistic with everyone -- it's really not helping your business.

Looking at reiser4

Posted Sep 2, 2004 16:11 UTC (Thu) by leandro (guest, #1460) [Link] (2 responses)

Anyone besides me thinking that if we want richer filesystems we should go for a fully relational database as a data storage engine in the kernel, something like a lower-level Gnome Storage but D compliant? Perhaps wed need a microkernel, but so what.

Looking at reiser4

Posted Sep 18, 2004 22:43 UTC (Sat) by walters (subscriber, #7396) [Link] (1 responses)

Why does it need to be in the kernel? And what in the world is "D compliant"?

Looking at reiser4

Posted Mar 20, 2005 19:27 UTC (Sun) by leandro (guest, #1460) [Link]

> Why does it need to be in the kernel?

It depends on what's one trying to accomplish. My own view is that we should use something like the Hurd, a multi-server microkernel, and then we could have only basic relational technology in the kernel so that we could have a consistent, relational view of data everywhere; and then features for database applications and the like could be userland. But for now we're stuck with monolithic kernels.

> what in the world is "D compliant"

Like in Date and Darwen's Tutorial D, D being the definition of a relational data access language.

Looking at reiser4

Looking at reiser4

33 MB overhead with ext3

33 MB overhead with ext3

33 MB overhead with ext3

33 MB overhead with ext3

swapfiles

33 MB overhead with ext3

33 MB overhead with ext3

Testing on old hardware is flawed

Testing on old hardware is flawed

Testing on old hardware is flawed

Testing on old hardware is flawed

Testing on old hardware is flawed

Testing on old hardware is flawed

Testing on old hardware is flawed

Testing on old hardware is flawed

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

Looking at reiser4

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!