-
Notifications
You must be signed in to change notification settings - Fork 167
store: implement repair process #11
Comments
Wouldn't a merkle tree like system, be a better alternative? |
Hmm, yes, potentially! I'd have to do a bit more research, but it seems like each store node could keep a Merkle tree of the segments it knows about, and that could be done without coördination. |
You could repurpose the same hashes to detect data corruption on a segment level, too. I have built a similar system at work, to detect divergence between different system. |
I think hashing ULIDs alone without fetching the data, would be enough since they're unique. |
@peterbourgon I've spent the last hour or so reading about oklog and what you've been working on -- really excited to find it! I was just hoping to clarify the implications of this pending enhancement. In the case that a store node is lost, is there currently no (simple) way restore the cluster to a state where a node can be lost again without permanently lost segments? |
@untoldone If you're running a 3-node cluster with a replication factor of 2, and a single node dies, and (let's say) is replaced by a new, empty node, then ca. 33% of your log data will exist only on a single node, and is vulnerable to permanent loss if that node dies. (Observe this scenario will continue to be true until you've run in this mode for your log retention period, after which you'll be back and fully-replicated.) Pretend we're in this situation, and someone performs a query, which returns log data that only exists on a single node. It's possible for the query subsystem to detect that it only received a single copy of each of those logs (instead of the expected 2) and then trigger a so-called read repair, where it takes the logs it received and writes it to a node that doesn't already have it. This enhancement is about implementing that process, but rather than doing it reactively to user queries, doing it proactively by making regular, continuous queries of the entire log time range, in effect generating fake read load on the cluster. (In that sense, this ticket depends on #6.) To answer your question, there are several ways to avoid this situation. One, perform regular backups of node disks, and restore them if/when a node dies — OK Log handles this scenario gracefully, without operator intervention. Two, have a larger cluster, with a higher replication factor, to tolerate more node loss. Hope this helps. |
Thanks! This is really helpful and I really appreciate the quick reply. Sorry if this is off-topic for this bug, but as a follow up to your response, hypothetically, in a cluster of 3 storage nodes, would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc) and in the case of node failure, copy all data files back to a new node before re-joining the cluster? Would I want to backup both the store and ingest directory? Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)? I didn't notice anything in the design that suggested this wouldn't be ok, but thought I'd ask as I didn't see any explicit comments around this type of topic. |
Yes.
Yes, also safe.
Yes, also safe.
Nope, there are no index files or anything. If you start up a new ingest or store node with some restored data in the relevant data dir, the node should pick up and manage everything transparently. |
Awesome, thanks! |
The repair process should walk the complete data set (the complete timespace) and essentially perform read repair. When it completes, we should have guaranteed that all records are at the desired replication factor. (There may be smarter ways to do this.)
The text was updated successfully, but these errors were encountered: