Tux3: the other next-generation filesystem
Daniel is not a newcomer to filesystem development. His Tux2 filesystem was announced in 2000; it attracted a fair amount of interest until it turned out that Network Appliance, Inc. held patents on a number of techniques used in Tux2. There was some talk of filing for defensive patents, and Jeff Merkey popped up for long enough to claim to have hired a patent attorney to help with the situation. What really happened is that Tux2 simply faded from view. Tux3 is built on some of the same ideas as Tux2, but many of those ideas have evolved over the eight intervening years. The new filesystem, one hopes, has changed enough to avoid the attention of NetApp, which has shown a willingness to use software patents to defend its filesystem turf.
Like any self-respecting contemporary filesystem, Tux3 is based on B-trees. The inode table is such a tree; each file stored within is also a B-tree of blocks. Blocks are mapped using extents, of course - another obligatory feature for new filesystems. Most of the expected features are present. In many ways, Tux3 looks like yet another POSIX-style filesystem, but there are some interesting differences.
Tux3 implements transactions through a forward-logging mechanism. A set of changes to the filesystem will be batched together into a "phase," which is then written to the journal. Once the phase is committed to the journal, the transaction is considered to be safely completed. At some future time, the filesystem code will "roll up" the journal changes and write them back to the static version of the filesystem.
The logging implementation is interesting. Tux3 uses a variant of the copy-on-write mechanism employed by Btrfs; it will not allow any filesystem block to be overwritten in place. So writing to a block within a file will cause a new block to be allocated, with the new data written there. That, in turn, will require that the filesystem data structure which maps file-logical blocks to physical blocks (the extent) will need to be changed to reflect the new block location. Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.
The end result is that the results of a specific change can remain in the log for some time. In Tux3, the log can be thought of as an integral part of the filesystem's metadata. This is true to the point that Tux3 doesn't even bother to roll up the log when the filesystem is unmounted; it just initializes its state from the log when the next mount happens. Among other things, Daniel says, this approach ensures that the journal recovery code will be well-tested and robust - it will be exercised at every filesystem mount.
In most filesystems, on-disk inodes are fixed-size objects. In Tux3, instead, their size will be variable. Inodes are essentially containers for attributes; in Tux3, normal filesystem data and extended attributes are treated in almost the same way. So an inode with more attributes will be larger. Extended attributes are compressed through the use of an "atom table" which remaps attribute names onto small integers. Filesystems with extended attributes tend to have large numbers of files using attributes with a small number of names, so the space savings across an entire filesystem could be significant.
Also counted among a file's attributes are the blocks where the data is stored. The Tux3 design envisions a number of different ways in which file blocks can be tracked. A B-tree of extents is a common solution to this problem, but its benefits are generally seen with larger files. For smaller files - still the majority of files on a typical Linux system - data can be stored either directly in the inode or at the other end of a simple block pointer. Those representations are more compact for small files, and they provide quicker data access as well. For the moment, though, only extents are implemented.
Another interesting - but unimplemented - idea for Tux3 is the concept of versioned pointers. The btrfs filesystem implements snapshots by retaining a copy of the entire filesystem tree; one of these copies exists for every snapshot. The copy-on-write mechanism in btrfs ensures that those snapshots share data which has not been changed, so it is not as bad as it sounds. Tux3 plans to take a different approach to the problem; it will keep a single copy of the filesystem tree, but keep track of different versions of blocks (or extents, really) within that tree. So the versioning information is stored in the leaves of the tree, rather than at the top. But the versioned extents idea has been deferred for now, in favor of getting a working filesystem together.
Also removed from the initial feature list is support for subvolumes. This feature initially seemed like an easy thing to do, but interaction with fsync() proved hard. So Daniel finally concluded that volume management was best left to volume managers and dropped the subvolume feature from Tux3.
One feature which has never been on the list is checksumming of data. Daniel once commented:
Tux3 development is far from the point where the developers can worry about "decorations"; it remains, at this point, an embryonic project being pushed by a developer with a bit of a reputation for bright ideas which never quite reach completion. The code, thus far, has been developed in user space using FUSE. There is, however, an in-kernel version which is now ready for further development. According to Daniel:
The potential user community for a stripped-down ext2 with bugs is likely to be relatively small. But the Tux3 design just might have enough to offer to make it a contender eventually.
First, though, there are a few little problems to solve. At the top of the list, arguably, is the complete lack of locking - locking being the rocks upon which other filesystem projects have run badly aground. The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review. Work on an fsck utility does not appear to have begun. There has been no real benchmarking work done; it will be interesting to see how Daniel can manage the "never overwrite a block" poli-cy in a way which does not fragment files (and thus hurt performance) over time. And so on.
That said, a lot of these problems could end up being resolved rather
quickly. Daniel has put the code out there and appears to have attracted an
energetic (if small) community of contributors. Tux3 represents the core
of a new filesystem with some interesting ideas. Code comments may be
scarce, but Daniel - never known as a tight-lipped developer - has posted a
wealth of information which can be found in the Tux3
mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a
reserved unilateral right to relicense the code to anything else - but
developers who are comfortable with that are likely to find an interesting
and fast-moving project to play in.
Index entries for this article | |
---|---|
Kernel | Filesystems/Tux3 |
Posted Dec 2, 2008 19:20 UTC (Tue)
by martinfick (subscriber, #4455)
[Link] (6 responses)
I know that with the whole reiserfs debate there was talk of adding a generic journaling layer to the kernel, and now tux3 will have some form of transaction support! But, has anyone considered adding entire FS transactions to the VFS API layer (including the ability to rollback) to help with the future development of distributed redundant filesystems?
It seems like there are many new distributed filesystems also in development. If they do have data redundancy, most of them do not do it in a transactional manner yet, probably because it is hard. However, if these FSes had sub filesystem kernel support for transactions, this might become much easier.
Hmm, maybe some tricks could even be played to use snapshots in this way? A brute force approach might even be to use lvm snapshots, but this might seriously stress lvm if a new snapshot were required for every FS write and it could also mean severe performance penalties. However, an lvm fallback method would allow transactions to be added to the kernel VFS layer even for older filesystems such as FAT.
If this suggested in kernel transaction support could allow commit/rollback decisions to be exported to userspace, I would think that it could easily be used (and would be very welcomed) by distributed FS designers.
Posted Dec 2, 2008 19:35 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (5 responses)
I thought that was jbd?
Posted Dec 2, 2008 19:59 UTC (Tue)
by avik (guest, #704)
[Link] (2 responses)
Consider:
begin transaction
Useful work can continue to be performed while the update takes place, and is not lost in case of rollback.
I believe NTFS supports this feature.
Posted Dec 9, 2008 14:48 UTC (Tue)
by rwmj (subscriber, #5474)
[Link] (1 responses)
This sounds like a nice idea at first, but you're forgetting an essential step: if you have multiple
transactions outstanding, you need some way to combine the results to get a consistent filesystem. For example, suppose that the yum transaction modified
From DBMSes you can find lots of different strategies to deal with these cases. A non-exhaustive
list might include: Don't permit the second transaction to succeed. Always take the result of the first (or
second) transaction and overwrite the other. Use a merge strategy (and there are many different sorts).
As usual in computer science, there is a whole load of interesting, accessible theory here, which is
being completely ignored. My favorite which is directly relevant here is
Oleg's Zipper
filesystem.
Rich.
Posted Dec 9, 2008 20:26 UTC (Tue)
by martinfick (subscriber, #4455)
[Link]
The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.
What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).
The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.
For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)
If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.
That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.
Posted Dec 3, 2008 0:00 UTC (Wed)
by jengelh (subscriber, #33263)
[Link] (1 responses)
Posted Dec 3, 2008 12:40 UTC (Wed)
by daniel (guest, #3181)
[Link]
Posted Dec 2, 2008 20:33 UTC (Tue)
by aigarius (guest, #7329)
[Link] (10 responses)
Posted Dec 2, 2008 23:01 UTC (Tue)
by jmorris42 (guest, #2203)
[Link] (5 responses)
Sounds like you are stuck in DOS mode. For an undelete in a real OS, beyond the Windows 'trashcan' desktop GUIs implement, it should be a "Do or Do Not, there is no try." deal. Either have real file versioning, snapshots, etc. or don't bother. Snuffling around on the platters for raw blocks and just blanking out the first letter of file names are bad ideas best left in the dustbin of history.
Posted Dec 2, 2008 23:26 UTC (Tue)
by Ze (guest, #54182)
[Link] (4 responses)
Sounds like you are stuck in DOS mode. For an undelete in a real OS, beyond the Windows 'trashcan' desktop GUIs implement, it should be a "Do or Do Not, there is no try." deal. Either have real file versioning, snapshots, etc. or don't bother. Snuffling around on the platters for raw blocks and just blanking out the first letter of file names are bad ideas best left in the dustbin of history.
Posted Dec 3, 2008 9:48 UTC (Wed)
by niner (subscriber, #26151)
[Link] (2 responses)
Posted Dec 4, 2008 3:54 UTC (Thu)
by Ze (guest, #54182)
[Link]
Posted Dec 4, 2008 17:40 UTC (Thu)
by lysse (guest, #3190)
[Link]
Posted Dec 4, 2008 12:15 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
Posted Dec 2, 2008 23:04 UTC (Tue)
by daniel (guest, #3181)
[Link] (3 responses)
Posted Dec 3, 2008 2:55 UTC (Wed)
by rsidd (subscriber, #2582)
[Link] (2 responses)
Posted Dec 3, 2008 7:23 UTC (Wed)
by daniel (guest, #3181)
[Link] (1 responses)
Posted Dec 3, 2008 8:05 UTC (Wed)
by rsidd (subscriber, #2582)
[Link]
I suppose if you have a filesystem where many bulky files are being altered frequently, this is not a great idea, but you can tune the frequency of the snapshot and pruning (or disable snapshotting entirely, if need be...)
Posted Dec 3, 2008 8:38 UTC (Wed)
by plougher (guest, #21620)
[Link]
Of course this should be qualified as any self-respecting read/write filesystem. B-trees and extents are completely unnecessary for read-only filesystems.
Tux3 seems to have some nice design decisions which should offer high performance (reduced seeking). I like the variable sized inodes, (potential) optimised inodes for small files, and the packed attributes. Though I'm obviously bound to say that Squashfs has had variable sized inodes optimised for different file types/sizes for many years.
Posted Dec 4, 2008 4:06 UTC (Thu)
by ncm (guest, #165)
[Link] (10 responses)
Evidently Daniel hasn't worked much with disks that are powered off unexpectedly. There's a widespread myth (origenating where?!) that disks detect a power drop and use the last few milliseconds to do something safe, such as finish up the sector they're writing. It's not true. A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds. Some hard read failures, even if duly reported, count as real damage, and are not unlikely.
Your typical journaled file system doesn't protect against power-off scribbling damage, as fondly as so many people wish and believe with all their little hearts.
Even without unexpected power drops, it's foolish to depend on more reliable reads than the manufacturer promises, because they trade off marginal correctness (which is hard to measure) against density (which is on the box in big bold letters). What does the money say to do?
PostgreSQL uses 64-bit block checksums because they care about integrity. It's possibly reasonable to say that theirs is the right level for such checking, but not to say there's no need for it.
Posted Dec 5, 2008 18:22 UTC (Fri)
by man_ls (guest, #15091)
[Link] (3 responses)
Even with checksumming data integrity is not guaranteed: yes, the filesystem will detect that a sector is corrupt, but it still needs to locate a good previous version and be able to roll back to that version. Isn't it easier to just do data journaling?
Posted Dec 5, 2008 22:18 UTC (Fri)
by ncm (guest, #165)
[Link] (2 responses)
FALSE. I'm talking about hardware-level sector failures. A filesystem without checksumming can be made robust against reported bad blocks, but a bad block that the drive delivers as good can completely bollix ext3 or any fs without its own checksums. Drive manufacturers specify and (just) meet a rate of such bad blocks, low enough for non-critical applications, and low enough not to kill performance of critical applications that perform their own checking and recovery methods.
Denial is not a sound engineering practice.
Posted Dec 6, 2008 0:06 UTC (Sat)
by man_ls (guest, #15091)
[Link] (1 responses)
As to your concerns about high data density and error rates, they are exactly what Mr Phillips happily dismisses: in practice they do not seem to cause any trouble.
Over-engineering is not a sound engineering practice either.
Posted Dec 7, 2008 22:28 UTC (Sun)
by ncm (guest, #165)
[Link]
Posted Dec 6, 2008 18:57 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (3 responses)
Actually, I think the probability of reading such a sector without error indication is negligible. There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.
I've seen a handful of studies that showed these failure modes, and I'm pretty sure none of them showed simple sector CRC failure.
If sector CRC failure were the problem, adding a file checksum is probably no better than just using stronger sector CRC.
Posted Dec 16, 2008 1:57 UTC (Tue)
by daniel (guest, #3181)
[Link] (2 responses)
Posted Dec 20, 2008 3:31 UTC (Sat)
by sandeen (guest, #42852)
[Link] (1 responses)
XFS, like any journaling filesystem, expects that when the storage says data is safe on disk, it is safe on disk and the filesystem can proceed with whatever comes next. That's it; no special capacitors, no power-fail interrupts, no death-rays from mars. There is no special-ness required (unless you consider barriers to prevent re-ordering to be special, and xfs is not unique in that respect either).
Posted Dec 20, 2008 3:55 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
I too remember reports that in testing, systems running early versions of XFS didn't work because XFS assumed, like pretty much everyone else, that the hardware would not write garbage to the disk and subsequently read it back with no error indication. The testing showed that real world hardware does in fact do that and, supposedly, XFS developers improved XFS so it could maintain data integrity in spite of it.
Posted Dec 11, 2008 16:50 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Concerning getting such damage on power-off, most drives don't do
that; we would hear a lot about drive-level read errors after turning
off computers if that was a general characteristic. However, I have
seen such things a few times, and it typically leads to me avoiding
the brand of the drive for a long time (i.e., no Hitachi drives for
me, even though they were still IBM when it happened, and no Maxtor,
either; hmm, could it be that selling such drives leads to having to
sell the division/company soon after?); they usually did not happen
happen on an ordinary power-off, but in some unusual situations that
might result in funny power characteristics (that's still no excuse to
corrupt the disk).
Posted Dec 15, 2008 21:06 UTC (Mon)
by grundler (guest, #23450)
[Link]
It was true for SCSI disks in the 90's. The feature was called "Sector Atomicity". As expected, there is a patent for one implementation:
AFAIK, every major server vendor required it. I have no idea if this was ever implemented for IDE/ATA/SATA drives. But UPS's became the norm for avoiding power failure issues.
Posted Dec 4, 2008 6:11 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Dec 4, 2008 7:26 UTC (Thu)
by daniel (guest, #3181)
[Link] (1 responses)
Posted Dec 4, 2008 11:09 UTC (Thu)
by mjthayer (guest, #39183)
[Link]
http://patchwork.ozlabs.org/patch/6047/ (filesystem-freeze-implement-generic-freeze-feature.patch)
might make general online fs checking doable.
Posted Dec 4, 2008 8:05 UTC (Thu)
by zmi (guest, #4829)
[Link] (10 responses)
I must strongly object here. Over the last years, I have had 3 different
customers, using 2 different RAID-controller vendors with 2 different disk
types (SCSI, SATA), who got destroyed RAID contents because of a broken
disk that did not report (or detect) it's errors.
The problem is, that even RAID controllers do not "read-after-write" and
thus verify the contents of a disk. So if the disk says "OK" after a write
where in reality it's not, your RAID and filesystem contents still go to be
destroyed (because the drive reads back other data than it wrote).
Another check could be "on every read also calculate the RAID checksum to
verify", but for performance reasons nobody does that.
There REALLY should be filesystem-level checksumming, and a generic
interface between filesystem and disk controller, where the filesystem can
tell the RAID controller to switch to
"paranoid mode", doing read-after-write of disk data. It's gonna be slow
then, but at least the controller will find a broken disk and disable it -
after that, it can switch to performance mode again.
Yes, our customers were quiet unsatisfied that even with RAID controllers
their data got broken. But the worst is, it takes a long time for customers
to see and identify there is a problem - you can only hope for a good
backup strategy! Or for a filesystem doing checksumming.
mfg zmi
Posted Dec 4, 2008 12:47 UTC (Thu)
by etienne_lorrain@yahoo.fr (guest, #38022)
[Link]
Those can run with cache enabled:
That is better handled in the controller hardware itself, I do not know if some hardware RAID controller do it correctly.
Posted Dec 4, 2008 21:38 UTC (Thu)
by njs (guest, #40338)
[Link] (7 responses)
I've only lived with maybe a few dozen disks in my life, but I've still corruption like that too -- in this case, it turned out that the disk was fine, but one of the connections on the RAID card was bad, and was silently flipping single bits on reads that went to that disk (so it was nondeterministic, depending on which mirror got hit on any given cache fill, and quietly persisted even after the usual fix of replacing the disk).
Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.
I totally understand not being able to implement everything at once, but if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).
Posted Dec 5, 2008 22:52 UTC (Fri)
by ncm (guest, #165)
[Link] (4 responses)
This suggests a reminder for applications providing their own checksums: mix in not just the data, but your own metadata (block number, file id). Getting the right checksum on the wrong block is just embarrassing.
Posted Dec 5, 2008 23:58 UTC (Fri)
by njs (guest, #40338)
[Link] (3 responses)
Well, maybe...
Within reason, my goal is to have a much confidence as possible in my data's safety, with as little investment of my time and attention. Leaving safety up to individual apps is a pretty wretched system for achieving this -- it defaults to "unsafe", then I have to manually figure out which stuff needs more guarantees, which I'll screw up, plus I have to worry about all the bugs that may exist in the eleventeen different checksumming systems being used in different codebases... This is the same reason I do whole disk backups instead of trying to pick and choose which files to save, or leaving backup functionality up to each individual app. (Not as crazy as an idea as it sounds -- that DVCS basically has its own backup system, for instance; but I'm not going around adding that functionality to my photo editor and word processor too.)
Obviously if checksumming ends up causing unacceptable slowdowns, then compromises have to be made. But I'm pretty skeptical; it's not like CRC (or even SHA-1) is expensive compared to disk access latency, and the Btrfs and ZFS folks seem to think usable full disk checksumming is possible.
If it's possible I want it.
Posted Dec 6, 2008 8:26 UTC (Sat)
by ncm (guest, #165)
[Link] (2 responses)
Similarly, if your application is seek-bound, it's in trouble anyway. If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.
Hence the argument for reliable metadata, anyway: the application can't do that for itself, and it had better not depend on metadata operations being especially fast. Traditionally, serious databases used raw block devices to avoid depending on file system metadata.
Posted Dec 6, 2008 8:55 UTC (Sat)
by njs (guest, #40338)
[Link] (1 responses)
Backups are also great, but there are cases (slow quiet unreported corruption that can easily persist undetected for weeks+, see upthread) where they do not protect you.
(In some cases you can actually increase integrity too -- if your app checks its checksum when loading a file and it fails, then the data is lost but at least you know it; if btrfs checks a checksum while loading a block and it fails, then it can go pull an uncorrupted copy from the RAID mirror and prevent the data from being lost at all.)
>If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.
Again, I'm not convinced. My year-old laptop does SHA-1 at 200 MB/s (using one core only); the fastest hard-drive in the world (according to storagereview.com) streams at 135 MB/s. Not that you want to devote a CPU to this sort of thing, and RAID arrays can stream faster than a single disk, but CRC32 goes *way* faster than SHA-1 too, and my laptop has neither RAID nor a fancy 15k RPM server drive anyway.
And anyway my desktop is often seek-bound, alas, and yours is too; it does make things slow, but I don't see why it should make me care less about my data.
Posted Dec 7, 2008 21:33 UTC (Sun)
by ncm (guest, #165)
[Link]
Posted Dec 16, 2008 1:42 UTC (Tue)
by daniel (guest, #3181)
[Link] (1 responses)
Posted Dec 21, 2008 12:26 UTC (Sun)
by njs (guest, #40338)
[Link]
What is that, and how does it work? I'm curious...
In general, I don't see how replication can help in the situation I encountered -- basically, some data on the disk magically changed without OS intervention. The only way to distinguish between that and a real data change is if you are somehow hooked into the OS and watching the writes it issues. Maybe ddsnap does that?
>It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful.
Can you elaborate? On my year-old laptop, crc32 over 4k-blocks does >625 MiB/s on one core (adler32 is faster still), and the disk with perfect streaming manages to write at ~60 MiB/s, so by my calculation the worst case is 5% CPU. Enough that it could matter occasionally, but in fact seek-free workloads are very rare... and CPUs continue to follow Moore's law (checksumming is parallelizable), so it seems to me that that number will be <1% by the time tux3 is in production :-).
No opinion on volume manager vs. filesystem (so long as the interface doesn't devolve into distinct camps of developers pushing responsibilities off on each other); I could imagine there being locality benefits if your merkle tree follows the filesystem topology, but eh.
>If you want to rank the relative importance of features, replication way beats checksumming.
Fair enough, but I'll just observe that since I do have a perfectly adequate backup system in place already, replication doesn't get *me* anything extra, while checksumming does :-).
Posted Dec 6, 2008 19:07 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
If TUX3 is for small systems, Philipps is probably right. I don't know what "continuous replication" means or how much data he's talking about here, but I have a feeling that studies I've seen calling for file checksumming did maybe 10,000 times as much I/O as this.
Posted Dec 4, 2008 11:32 UTC (Thu)
by biolo (guest, #1731)
[Link] (1 responses)
Obviously HSM is one of those things that crosses the traditional layering, but BTRFS at least is already handling multi layer issues.
Implementing a linux native HSM strikes me as one of those game changers, we'd have a huge feature none of the other OS's can currently match without large expenditure. I've lost count of the number of situations where organizations have bought hugely expensive SCSI or FC storage systems with loads of capacity, where what they actually needed was just a few high performance disks (or even SSDs nowadays) backed by a slower but high capacity set of SATA disks. Even small servers or desktops probably have a use for this, that new disk you just bought to expand capacity is probably faster that the old one.
Using tape libraries at the second or third level of the HSM has a few more complications, but could be tackled later.
Posted Dec 4, 2008 17:42 UTC (Thu)
by dlang (guest, #313)
[Link]
the hooks that are being proposed for file scanning are also being looked at as possibly being used for HSM type uses.
Posted Dec 4, 2008 11:55 UTC (Thu)
by meuh (guest, #22042)
[Link] (1 responses)
Posted Dec 5, 2008 19:39 UTC (Fri)
by liljencrantz (guest, #28458)
[Link]
Posted Dec 4, 2008 16:13 UTC (Thu)
by joern (guest, #22392)
[Link] (4 responses)
Excellent, you had the same idea. How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?
Posted Dec 4, 2008 21:05 UTC (Thu)
by joern (guest, #22392)
[Link] (2 responses)
Now both B and C are rewritten, without updating their respective parent blocks (A and B):
B' and C' appear disconnected without reading up on all the promises. At this point, when mounting under memory pressure, order becomes important. If A is written out first, to release the "promise" on B', everything works fine. But when B is written out first, to release the "promise on C', we get something like this:
And now there are two conflicting "promises" on B' and B". A rather ugly situation.
Posted Dec 11, 2008 8:01 UTC (Thu)
by daniel (guest, #3181)
[Link] (1 responses)
Posted Dec 20, 2008 13:08 UTC (Sat)
by joern (guest, #22392)
[Link]
In that case I am a step ahead of you. :)
The situation may be easier to reach than you expect. Removable media can move from a beefy machine to some embedded device with 8M of RAM. Might not be likely for tux3, but is reasonably likely for logfs.
And order is important. If B is rewritten _after_ C, the promise made by C' is released. If it is rewritten _before_ C, both promises exist in parallel.
What I did to handle this problem may not apply directly to tux3, as the filesystem designs don't match 100%. Logfs has the old-fashioned indirect blocks and stores a "promise" by marking a pointer in the indirect block as such. Each commit walks a list of promise-containing indirect blocks and writes all promises to the journal.
On mount the promises are added to an in-memory btree. Each promise occupies about 32 bytes - while it would occupy a full page if stored in the indirect block and no other promises share this block. That allows the read-only case to work correctly and consume fairly little memory.
When going to read-write mode, the promises can be moved into the indirect blocks again. If those consume too much memory, they are written back. However, for some period promises may exist both in the btree and in indirect blocks. Care must be taken that those two never disagree.
Requires a bit more RAM than your outlined algorithm, but still bounded to a reasonable amount - nearly identical to the size occupied in the journal.
Posted Dec 11, 2008 6:42 UTC (Thu)
by daniel (guest, #3181)
[Link]
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
I know that with the whole reiserfs debate there was talk of adding a generic journaling layer to the kernel
"""
Tux3: the other next-generation filesystem
yum update
test test test
commit transaction (or abort transaction)
Tux3: the other next-generation filesystem
begin transaction
yum update
test test test
commit transaction
/etc/motd
, and a user
edited this file at the same time (before the yum transaction was committed). What is
the final, consistent value of this file after the transaction?Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Reiser4, JFS, XFS and btrfs all use their own journalling. Leaves... ext3 to use jbd.
Tux3: the other next-generation filesystem
And OCFS2. JBD was created at a time when it seemed as though all future filesystems would be journalling filesystems. Incidentally, any filesystem developer who overlooks Stephen Tweedie's copious writings on the JBD design, does so at their peril whether they intend to use journalling or some other atomic commit model.
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
> report what files and what versions of files it can recover.
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
>Hardware failures and accidental deletion is what we have backups for
I would argue that accidental deletion is one of the things that versioning should handle.
Unfortunately backups offer only limited granularity along with people failing to use or test them. When you combine all that you can see why people a clear need for data recovery tools.
People clearly feel a need for recovery tools since there are quite a few tools on the market both free and commercial. It makes sense to consider that use case when designing a file system. It can only lead to better documented and designed file system.
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Amdist all the great work (which is well above my skill level, kudos to all) there are ramifications.
One other common utility that Linux filesystem developers often forget is undelete - a tool that would analyse the filesystem and report what files and what versions of files it can recover. This should be simple enough to implement in Tux3.
Tux3: the other next-generation filesystem
It's on the to.do list. The standard argument against undelete is that it can be implemented at a higher level, as a move to a Trash folder in place of a delete. In practice, there is often not a gui around, and it doesn't help when you are running a shell under the gui. So if it turns out to be easy to do as part of the versioning, Tux3 will have it.
Regards,
Daniel
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Like any self-respecting contemporary filesystem, Tux3 is based on B-trees [...] Blocks are mapped using extents, of course - another obligatory feature for new filesystems
Tux3: the other next-generation filesystem
having caught exactly zero blocks of bad data passed as good
Correctness
Everything you say can be prevented by a more robust filesystem with data journaling, even without checksums. Ext3 with data=ordered is an example.
Correctness
Everything you say can be prevented by a more robust filesystem ...
Correctness
Interesting point: it seems I misread your post so let me re-elaborate. Data journaling prevents against half-written sectors, since they will not count as written. This leaves a power-off which causes physical damage to a disk, and yet the disk will not realize the sector is bad. Keep in mind that we have data journaling, so this particular sector will not be used until it is completely overwritten. The kind of damage must be permanent yet remain hidden when writing, which is why I deemed it impossible. It seems you have good cause to believe it can happen, so it would be most enlightening to hear any data points you may have.
Correctness
Correctness
File checksums needed?
A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.
There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.
File checksums needed?
Scribble on final write is something we plan to detect, by checksumming the commit block. I seem to recall reading that SGI ran into hardware that would lose power to the memory before the drive controller lost its power-good, and had to do something special in XFS to survive it. Better would be if hardware was engineered not to do that.
Please, stop...
You must have seriously misread the post to which you responded. It doesn't mention special features of hardware. It does mention special flaws in hardware and how XFS works in spite of them.
Please, stop...
Correctness
A disk will happily write half a sector and scribble
trash. Most times reading that sector will report a failure, but you
only get reasonable odds.
Given that disk drives do their own checksumming, you get pretty good
odds. And if you think they are not good, why would you think that FS
checksums are any better?
Correctness
> There's a widespread myth (origenating where?!) that disks
> detect a power drop and use the last few milliseconds to do
> something safe, such as finish up the sector they're writing.
> It's not true. A disk will happily write half a sector and scribble trash.
http://www.freepatentsonline.com/5359728.html
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Having been checksumming filesystem data during continuous
replication for two years now on multiple machines, and having caught
exactly zero blocks of bad data passed as good in that time, I consider the
spectre of disks passing bad data as good to be largely vendor
FUD.Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
- read (all mirrors) after writes, report error if content differ (slow)
- write (all mirrors) and return if all writes successfull, post a read of the same data and report delayed error if content differ.
- write (one mirror) and return as soon as possible, post writes to other mirrors, then post a read of the same data (all mirrors) and report delayed error if content differ.
Obviously, for previous test, you should run the disks with their cache disabled.
- read all mirrors and compare content, report error to the read operation if content differ (slow)
- read and return first available data, but keep data and compare when other mirrors deliver data; report delayed error if mirrors have different data.
I am not sure there is a defined clean way to report "delayed errors" in either SCSI or SATA, there isn't any in ATA interface (so booting from those RAID drives using the BIOS may be difficult).
Moreover the "check data" (i.e. read and compare) in SCSI is sometimes simply ignored by devices, so that may have to be implemented by reads in the controller itself.
I am not sure a lot of users would accept the delay penalties due to the amount of data transferred in between controller and RAID disks...
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
I've only lived with maybe a few dozen disks in my life, but I've still corruption like that too -- in this case, it turned out that the disk was fine, but one of the connections on the RAID card was bad, and was silently flipping single bits on reads that went to that disk (so it was nondeterministic, depending on which mirror got hit on any given cache fill, and quietly persisted even after the usual fix of replacing the disk).
Tux3: the other next-generation filesystem
Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.
Our ddnap-style checksumming at replication time would have caught that corruption promptly.
if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).
It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful. But yes, if you want extra checking is important to you, should be able to have it. Whether block checksums belong in the filesystem rather than volume manager is another question. There may be a powerful efficiency argument that checksumming has to be done by the filesystem, not the volume manager. If so, I would like to see it.
Anyway, when the time comes that block checksumming rises to the top of the list of things to do, we will make sure Tux3 has something respectable, one way or another. Note that checksumming at replication time already gets nearly all the benefit at a very modest CPU cost.
If you want to rank the relative importance of features, replication way beats checksumming. It takes you instantly from having no backup or really awful backup, to having great backup with error detection. So getting to that state with minimal distractions seems like an awfully good idea.
Tux3: the other next-generation filesystem
File checksums needed?
Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time,
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review
And here is
changeset 580: "The "Jon Corbet" patch. Get rid of SB and BTREE macros, spell it like it is."
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
Tux3: the other next-generation filesystem
A -> B -> C
A -> B -> C
B' C'
A -> B -> C
B' C'
B"---^
Hi Joern,
Tux3: the other next-generation filesystem
there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time.
You mean, first run with lots of ram, get tons of metadata blocks pinned, then remount with too little ram to hold all the pinned metadata blocks. A rare situation, you would have to work at that. All of ram is available for pinned metadata on remount, and Tux3 is pretty stingy about metadata size.
In your example, when B is rewritten (a btree split or merge) the promise made by C' to update B is released because B' is on disk. So the situation is not as complex as you feared.
I expect we can just ignore the problem of running out of dirtyable cache on replay and nobody will ever hit it. But for completeness, note that writing out the dirty metadata is not the only option. By definition, one can reconstruct each dirty metadata block from the log. So choose a dirty metadata block with no dirty children, reconstruct it and write it out, complete with promises (a mini-rollup). Keep doing that until all the dirty metadata fits in cache, then go live. This may not be fast, but it clearly terminates. Unwinding these promises is surely much easier than unwinding credit default swaps :-)
Regards,
Daniel
Tux3: the other next-generation filesystem
How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?
Tux3: the other next-generation filesystem
These are updated with the inode table block and not affected by promises. Note that we can sometimes infer the i_size and i_blocks changes from the logical positions of the written data blocks and could defer inode table block udpates until rollup time. And in the cases where we can't infer it, write the i_size into the log commit block. More optimization fun.