Hinted Handoff - System Design
Hinted Handoff - System Design
Hinted Handoff
High Availability Architecture
NK included in Fundamentals
2023-08-12 2078 words 10 minutes
The target audience for this article falls into the following roles:
Tech workers
Students
Engineering managers
The prerequisite to reading this article is fundamental knowledge of system design
components. This article does not cover an in-depth guide on individual system
design components.
Disclaimer: The system design questions are subjective. This article is written based
on the research I have done on the topic and might differ from real-world
implementations. Feel free to share your feedback and ask questions in the comments.
Some of the linked resources are affiliates. As an Amazon Associate, I earn from
qualifying purchases.
Get the powerful template to approach system design for FREE on newsletter
sign-up:
I would never voluntarily build the hinted handoff model of consistency repair
again.
- Ryan Betts, InfluxData
# Terminology
The following terminology might be helpful for you:
Node: a server that provides functionality to other services
Coordinator Node: a node that determines the target node to handle the request
in the cluster
Amazon Dynamo: a highly available distributed key-value data store
https://systemdesign.one/hinted-handoff/ 1/8
29/04/2024, 12:38 Hinted Handoff - System Design
Apache Cassandra: a distributed, wide-column data store Join Newsletter
System Design
Apache Kafka: a distributed event store and stream-processing platform Archive About
# Requirements
Design a distributed system pattern with the following characteristics:
| Functional Requirements
tolerate temporary node failures
simple to implement
| Non-Functional Requirements
high write availability
eventually consistent
scalable
# What Is High Availability Architecture?
Distributed systems are gaining popularity due to inherent increased fault tolerance
and the ability to eliminate single points of failure. A distributed system replicates the
data to provide high availability at the expense of reduced consistency 1.
The highly available distributed data stores such as Amazon Dynamo, and Apache
Cassandra implement the eventual consistency model 2, 3.
The systems with high-availability architecture must be able to maintain optimal
performance even during peak loads. High availability is typically measured as the
percentage of time a service remains available to the clients 4.
# Real-World Analogy of Hinted Handoff
SystemWithout the hinted handoff approach, either the sender must waitJoin
yourDesign
desk or you will miss the message.
untilNewsletter
you return toArchive About
SystemThe storage location of hints depends on the system implementation. For instance,
Design
Apache Jointime
Cassandra stores the hints in the backup node for a certain Newsletter
frame 3. TheArchive About
backup node flushes the hints to disk-based storage every few seconds. Alternatively,
hints can be stored in the local directory of each node to improve the replay
performance 6.
The backup node will reject hints if the target node remains unavailable for more than
an extended period. The backup node should remove the hints when the target node
gets decommissioned. The hints for dropped tables must also be removed 7.
The backup node must track the number of hints that are written concurrently. The
amount of hints stored in the backup nodes increases when a significant amount of
target nodes becomes unavailable. There is a risk that an increased amount of hints
degrade the performance of the backup node resulting in write rejections or an error
response being thrown 6.
The following factors impact the lifecycle of the hints 9:
hint window: time frame allowed to collect hints
garbage collection grace time: expiration time of hints
time-to-live (TTL): validity of data mutations
The requirements and system-specific implementations should take into consideration
whether a failure should be shown to the client when the target node is temporarily
unavailable and hinted handoff is executed. On top of that, the health patterns of
hinted handoff are also debatable-it is difficult to determine an optimal frequency of
hinted handoff execution to declare a system healthy 1.
It is also common to draw parallels between hinted handoff and write-ahead logging
(WAL) replication. The WAL is relatively simpler but in essence shares similar
drawbacks to the hinted handoff pattern 11.
An alternative approach to implementing high availability architecture is to use a
shared-nothing approach by deploying a log-based journaling service such as Apache
Kafka. With this method, the data is ingested into the durable journal before being
written into the database 11, 10.
https://systemdesign.one/hinted-handoff/ 5/8
29/04/2024, 12:38 Hinted Handoff - System Design
SystemEarlier
release
releases of Cassandra used to store hints in the hints table.Join
Design However, the latest
Newsletter
of Cassandra utilizes flat files on disk to store hints 9. Cassandra persists theArchive About
hints for a particular database replica under a single partition key. Therefore,
Cassandra can replay hints via a sequential read operation with very little impact on
performance 7, 3.
Even though hinted handoff allows Cassandra to execute the same amount of write
operations when the cluster is operating at a reduced capacity, failures should be
permitted to enforce reliability and performance 7, 3.
# Hinted Handoff Advantages
The benefits of hinted handoff pattern can be summed up as follows 7, 8:
reduced read repairs and improved read performance
extremely high write availability
improved consistency after temporary outages such as network faults
improved fault tolerance
increased durability on temporary failures of target nodes via redirection of
writes to backup nodes
reduced latency by routing writes to healthy backup nodes
improved scalability through redirection of traffic to healthy backup nodes
# Hinted Handoff Disadvantages
The drawbacks of hinted handoff pattern can be summarized as the following 7, 8, 11,
12, 10:
reduced durability when hardware faults occur to backup nodes
increased system complexity
increased storage requirements due to the need for additional metadata
stale reads until hints are replayed
increased bandwidth usage due to data redirection
increased input-output (I/O) load on the backup nodes if numerous target nodes
become unavailable
noisy signal without being an actionable metric due to temporary failures
increased operational complexity on non-uniform workloads
potential thundering herd problem when the backup node tries to quickly replay
hints on a newly returned target node
The hinted handoff is a suboptimal architecture for load-shedding because the backup
nodes must tolerate additional load and journal the writes on behalf of the unavailable
target nodes. This approach will eventually result in degraded system performance. A
workaround to reduce the load on backup nodes is to deploy dedicated storage
partitions for storing hints 11, 10.
There are also cases when hinted handoff can be a precursor to a serious cluster
failure that might affect data durability. However, it is difficult to identify whether an
action should be taken at an early stage by the human operator 11, 10.
# Summary
The hinted handoff is a sophisticated approach to attain improved reliability and
resiliency in the eventual consistency model. The hinted handoff pattern is
implemented by distributed databases such as Amazon Dynamo and Apache
Cassandra 2, 3. As always, every software architecture pattern comes with a trade-off.
https://systemdesign.one/hinted-handoff/ 6/8
29/04/2024, 12:38 Hinted Handoff - System Design
# License
CC BY-NC-ND 4.0: This license allows reusers to copy and distribute the content in
this article in any medium or format in unadapted form only, for noncommercial
purposes, and only so long as attribution is given to the creator. The original article
must be backlinked.
# References
1. Katy Farmer, Eventual Consistency: The Hinted Handoff Queue (2018),
influxdata.com ↩︎
2. Giuseppe DeCandia, et al., Dynamo: Amazon’s Highly Available Key-value Store
(2007), allthingsdistributed.com ↩︎
3. Avinash Lakshman, Prashant Malik, Cassandra-A Decentralized Structured
Storage System (2009), cs.cornell.edu ↩︎
4. John Noonan, High Availability Architecture Demystified (2022), redis.com ↩︎
5. varunu28, Sloppy Quorum and Hinted handoff: Quorum in the times of failure
(2022), distributed-computing-musings.com ↩︎
6. Hinted handoff: repair during write path, docs.datastax.com ↩︎
7. Jonathan Ellis, Modern hinted handoff (2012), datastax.com ↩︎
8. Tamerlan Gudabayev, The Design Patterns for Distributed Systems Handbook
(2023), freecodecamp.org ↩︎
9. Radovan Zvoncek, Hinted Handoff and GC Grace Demystified (2018),
thelastpickle.com ↩︎
10. Ryan Betts, Lessons and Observations Scaling a Time Series Database (2018),
InfluxData ↩︎
11. Colin Breck, Shared-Nothing Architectures for Server Replication and
Synchronization (2019), blog.colinbreck.com ↩︎
12. varunu28, Paper Notes: Dynamo-Amazon’s Highly Available Key-value Store
(2022), distributed-computing-musings.com ↩︎
https://systemdesign.one/hinted-handoff/ 7/8
29/04/2024, 12:38 Hinted Handoff - System Design
Sponsored
0 Comments 1 Login
Name
https://systemdesign.one/hinted-handoff/ 8/8