store: implement repair process #11

peterbourgon · 2017-01-16T14:22:47Z

The repair process should walk the complete data set (the complete timespace) and essentially perform read repair. When it completes, we should have guaranteed that all records are at the desired replication factor. (There may be smarter ways to do this.)

lafikl · 2017-01-17T15:44:56Z

Wouldn't a merkle tree like system, be a better alternative?
at least to filter out segments that don't need replicating.

peterbourgon · 2017-01-17T16:09:27Z

Hmm, yes, potentially! I'd have to do a bit more research, but it seems like each store node could keep a Merkle tree of the segments it knows about, and that could be done without coördination.

lafikl · 2017-01-17T16:17:28Z

You could repurpose the same hashes to detect data corruption on a segment level, too.

I have built a similar system at work, to detect divergence between different system.
It's a bit expensive operation but totally worth it.

lafikl · 2017-01-17T16:59:52Z

I think hashing ULIDs alone without fetching the data, would be enough since they're unique.

untoldone · 2018-03-17T20:20:30Z

@peterbourgon I've spent the last hour or so reading about oklog and what you've been working on -- really excited to find it!

I was just hoping to clarify the implications of this pending enhancement.

In the case that a store node is lost, is there currently no (simple) way restore the cluster to a state where a node can be lost again without permanently lost segments?

peterbourgon · 2018-03-18T20:11:54Z

@untoldone If you're running a 3-node cluster with a replication factor of 2, and a single node dies, and (let's say) is replaced by a new, empty node, then ca. 33% of your log data will exist only on a single node, and is vulnerable to permanent loss if that node dies. (Observe this scenario will continue to be true until you've run in this mode for your log retention period, after which you'll be back and fully-replicated.)

Pretend we're in this situation, and someone performs a query, which returns log data that only exists on a single node. It's possible for the query subsystem to detect that it only received a single copy of each of those logs (instead of the expected 2) and then trigger a so-called read repair, where it takes the logs it received and writes it to a node that doesn't already have it. This enhancement is about implementing that process, but rather than doing it reactively to user queries, doing it proactively by making regular, continuous queries of the entire log time range, in effect generating fake read load on the cluster. (In that sense, this ticket depends on #6.)

To answer your question, there are several ways to avoid this situation. One, perform regular backups of node disks, and restore them if/when a node dies — OK Log handles this scenario gracefully, without operator intervention. Two, have a larger cluster, with a higher replication factor, to tolerate more node loss.

Hope this helps.

untoldone · 2018-03-19T00:59:53Z

Thanks! This is really helpful and I really appreciate the quick reply.

Sorry if this is off-topic for this bug, but as a follow up to your response, hypothetically, in a cluster of 3 storage nodes, would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc) and in the case of node failure, copy all data files back to a new node before re-joining the cluster? Would I want to backup both the store and ingest directory? Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

I didn't notice anything in the design that suggested this wouldn't be ok, but thought I'd ask as I didn't see any explicit comments around this type of topic.

peterbourgon · 2018-03-19T04:20:39Z

would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc)

Yes.

in the case of node failure, [we can] copy all data files back to a new node before re-joining the cluster?

Yes, also safe.

Would I want to backup both the store and ingest directory?

Yes, also safe.

Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

Nope, there are no index files or anything. If you start up a new ingest or store node with some restored data in the relevant data dir, the node should pick up and manage everything transparently.

untoldone · 2018-03-25T19:54:03Z

Awesome, thanks!

peterbourgon added the enhancement label Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: implement repair process #11

store: implement repair process #11

peterbourgon commented Jan 16, 2017

lafikl commented Jan 17, 2017

peterbourgon commented Jan 17, 2017

lafikl commented Jan 17, 2017

lafikl commented Jan 17, 2017

untoldone commented Mar 17, 2018

peterbourgon commented Mar 18, 2018 •

edited

Loading

untoldone commented Mar 19, 2018

peterbourgon commented Mar 19, 2018

untoldone commented Mar 25, 2018

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

store: implement repair process #11

store: implement repair process #11

Comments

peterbourgon commented Jan 16, 2017

lafikl commented Jan 17, 2017

peterbourgon commented Jan 17, 2017

lafikl commented Jan 17, 2017

lafikl commented Jan 17, 2017

untoldone commented Mar 17, 2018

peterbourgon commented Mar 18, 2018 • edited Loading

untoldone commented Mar 19, 2018

peterbourgon commented Mar 19, 2018

untoldone commented Mar 25, 2018

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

peterbourgon commented Mar 18, 2018 •

edited

Loading