Consistent Hashing
Consistent Hashing
It would be nice if, when a cache machine was added, it took its fair
share of objects from all the other cache machines. Equally, when a
cache machine was removed, it would be nice if its objects were
shared between the remaining machines. This is exactly what
consistent hashing does - consistently maps objects to the same cache
machine, as far as is possible, at least.
The basic idea behind the consistent hashing algorithm is to hash both
objects and caches using the same hash function. The reason to do
this is to map the cache to an interval, which will contain a number of
object hashes. If the cache is removed then its interval is taken over
by a cache with an adjacent interval. All the other caches remain
unchanged.
Demonstration
Let's look at this in more detail. The hash function actually maps
objects and caches to a number range. This should be familiar to
every Java programmer - the hashCode method on Object returns an int,
which lies in the range -231 to 231-1. Imagine mapping this range into a
circle so the values wrap around. Here's a picture of the circle with a
number of objects (1, 2, 3, 4) and caches (A, B, C) marked at the
points that they hash to (based on a diagram from Web Caching with
Consistent Hashing by David Karger et al):
To find which cache an object goes in, we move clockwise round the
circle until we find a cache point. So in the diagram above, we see
object 1 and 4 belong in cache A, object 2 belongs in cache B and
object 3 belongs in cache C. Consider what happens if cache C is
removed: object 3 now belongs in cache A, and all the other object
mappings are unchanged. If then another cache D is added in the
position marked it will take objects 3 and 4, leaving only object 1
belonging to A.
This works well, except the size of the intervals assigned to each
cache is pretty hit and miss. Since it is essentially random it is
possible to have a very non-uniform distribution of objects between
caches. The solution to this problem is to introduce the idea of "virtual
nodes", which are replicas of cache points in the circle. So whenever
we add a cache we create a number of points in the circle for it.
You can see the effect of this in the following plot which I produced by
simulating storing 10,000 objects in 10 caches using the code
described below. On the x-axis is the number of replicas of cache
points (with a logarithmic scale). When it is small, we see that the
distribution of objects across caches is unbalanced, since the standard
deviation as a percentage of the mean number of objects per cache
(on the y-axis, also logarithmic) is high. As the number of replicas
increases the distribution of objects becomes more balanced. This
experiment shows that a figure of one or two hundred replicas
achieves an acceptable balance (a standard deviation that is roughly
between 5% and 10% of the mean).
Implementation
For completeness here is a simple implementation in Java. In order for
consistent hashing to be effective it is important to have a hash
function that mixes well. Most implementations
of Object's hashCode do not mix well - for example, they typically produce
a restricted number of small integer values - so we have
a HashFunction interface to allow a custom hash function to be used.
MD5 hashes are recommended here.
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;
}
The circle is represented as a sorted map of integers, which represent
the hash values, to caches (of type T here).
When a ConsistentHash object is created each node is added to the circle
map a number of times (controlled by numberOfReplicas). The location of
each replica is chosen by hashing the node's name along with a
numerical suffix, and the node is stored at each of these points in the
map.
To find a node for an object (the get method), the hash value of the
object is used to look in the map. Most of the time there will not be a
node stored at this hash value (since the hash value space is typically
much larger than the number of nodes, even with replicas), so the
next node is found by looking for the first key in the tail map. If the
tail map is empty then we wrap around the circle by getting the first
key in the circle.
Usage
So how can you use consistent hashing? You are most likely to meet it
in a library, rather than having to code it yourself. For example, as
mentioned above, memcached, a distributed memory object caching
system, now has clients that support consistent hashing.
Last.fm's ketama by Richard Jones was the first, and there is now
a Java implementation by Dustin Sallings (which inspired my
simplified demonstration implementation above). It is interesting to
note that it is only the client that needs to implement the consistent
hashing algorithm - the memcached server is unchanged. Other
systems that employ consistent hashing include Chord, which is a
distributed hash table implementation, and Amazon's Dynamo, which
is a key-value store (not available outside Amazon).
Posted by Tom White at 17:26
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: Distributed Systems, Hashing
14 comments:
morrita said...
good article!
i've made Japanese translation for your article. which is available
at
http://www.hyuki.com/yukiwiki/wiki.cgi?ConsistentHashing .
if you have any trouble, please let me know.
thank you for your work.
1 December 2007 at 06:06
Marcus said...
Cool! I'm as we speak creating a distributed caching and searching
system which uses JGroups for membership. The biggest problem I
faced was this exact thing. What to do on the
member-joined/leaved events and for the system to be able to know
at all times to which node to send what command :)
marcusherou said...
Hi. Do you have any clue of how to create an algorithm which
tracks the history of joins/leves of members and delivers the same
node for the same key if it previously looked it up. Perhaps I'm
explaining this in bad terms but something like a (in memory or
persistent) database in cojunction with a consistent hash.
perhaps:
public Address getAddress(key)
{
if(lookedUpMap.containsKey(key))
{
return (Address)lookedUpMap.get(key)
}
else
{
Address a = get(key);
lookedUpMap.put(key, a);
return a;
}
}
naorei said...
This site has talks lots about the cache with its advantages, thanks
for the topic, it will lots of gain for to the visitors for this site.
Naorei
==============================
http://community.widecircles.com
29 July 2008 at 04:53
One consequence of this bug is that as nodes come and go you may
slowly lose replicas. A more serious consequence is that if two
clients notice changes to the set of nodes in a different order (very
possible in a distributed system), clients will no longer agree on
the key to node mapping. For example, suppose nodes x and y have
replicas that collide as above, and node x fails around the same
time that new (replacement?) node y comes online. In response,
suppose that ConsistentHash client A invokes add(y) ... remove(x)
and that another client B does the same but in the reverse order.
From now on, the set of keys corresponding to the clobbered
replica will map to node y at B, while A will map them somewhere
else.
Christophe said...
I have written a small test app that tells me how many nodes
should be relocated in the event of a node addition, by comparing
the same dataset over 2 hashsets.
http://pastebin.com/f459047ef
31 May 2009 at 01:40
chingju said...
great article, and the japanese translation from morrita is cool!!!
19 September 2009 at 03:19
V
Solution
We also need a data structure to maintain page numbers in cache in the order of their access time.
One way to do that is to keep a timestamp field for each record, but we still need to sort them which
cannot be done in O(1) time. Alternatively, we can use a linked list to keep all records, and move the
newly visited one to the head of the list. To get O(1) time complexity for updating such a linked list,
we need a doubly linked list.
If it is already in the cache, move the node to the head of the linked list;
If it is not in the cache, insert it to the head of the linked list and update the current capacity of the
cache. If the cache is full, remove the last node of the linked list. (So, we also need a tail pointer. :)
public static class LruCacheImpl implements LruPageCache {
private int capacity = 0;
private int maxCapacity = 10;
private DListNode head = null;
private DListNode tail = null;
private HashMap<Integer, DListNode> map = new HashMap<Integer,
DListNode>();
/** {@inheritDoc} */
@Override
public void setMaxCapacity(final int limit) {
if (limit < 1) {
throw InvalidInputException("Max capacity must be positive.");
}
maxCapacity = limit;
}
/** {@inheritDoc} */
@Override
public int loadPage(final int page) {
final DListNode cur;
// cache miss
cur = new DListNode(page);
insertToHead(cur);
map.put(page, cur);
if (capacity == maxCapacity) {
removeTail();
} else {
++capacity;
}
print();
return cur.val;
}
/** Add the given node to the head of the linked list. */
private void insertToHead(final DListNode cur) {
cur.next = head;
cur.pre = null;
if (head != null) head.pre = cur;
head = cur;
if (tail == null) tail = cur;
}
Test outputs:
Note: Java provides a LinkedHashMap class which implements hashmap on a doubly linked list. It is
essentially what we have done here except that LinkedHashMap has no capacity limit. So, in real world, if
you need an infinite LRU cache, don't reinvent wheel!
Next up in the toolbox series is an idea so good it deserves an entire article all to itself: consistent
hashing.
Let’s say you’re a hot startup and your database is starting to slow down. You decide to cache some
results so that you can render web pages more quickly. If you want your cache to use multiple servers
(scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key. If
you only have 5 to 10 minutes allocated for this problem on your development schedule, you’ll end up
using what is known as the naïve solution: put your N server IPs in an array and pick one using key %
N.
I kid, I kid — I know you don’t have a development schedule. That’s OK. You’re a startup.
Anyway, this ultra simple solution has some nice characteristics and may be the right thing to do. But
your first major problem with it is that as soon as you add a server and change N, most of your cache
will become invalid. Your databases will wail and gnash their teeth as practically everything has to be
pulled out of the DB and stuck back into the cache. If you’ve got a popular site, what this really means
is that someone is going to have to wait until 3am to add servers because that is the only time you can
handle having a busted cache. Poor Asia and Europe — always getting screwed by late night server
administration.
You’ll have a second problem if your cache is read-through or you have some sort of processing
occurring alongside your cached data. What happens if one of your cache servers fails? Do you just fail
the requests that should have used that server? Do you dynamically change N? In either case, I
recommend you save the angriest tweets about your site being down. One day you’ll look back and
As I said, though, that might be OK. You may be trying to crank this whole project out over the
weekend and simply not have time for a better solution. That is how I wrote the caching layer for
Audiogalaxy searches, and that turned out OK. The caching part, at least. But if had known about it at
the time, I would have started with a simple version of consistent hashing. It isn’t that much more
complicated to implement and it gives you a lot of flexibility down the road.
The technical aspects of consistent hashing have been well explained in other places, and you’re crazy
and negligent if you use this as your only reference. But, I’ll try to do my best. Consistent hashing is a
Given a resource key and a list of servers, how do you find a primary, second, tertiary (and on down
If you have different size servers, how do you assign each of them an amount of work that corresponds
to their capacity?
How do you smoothly add capacity to the system without downtime? Specifically, this means solving
two problems:
o How do you avoid dumping 1/N of the total load on a new server as soon as you turn it on?
clock face. Sure, this will make it more complicated when you try to explain it to your boss, but bear
with me:
Now imagine hashing resources into points on the circle. They could be URLs, GUIDs, integer IDs, or
any arbitrary sequence of bytes. Just run them through a good hash function (eg, SHA1) and shave off
everything but 8 bytes. Now, take those freshly minted 64-bit numbers and stick them onto the circle:
Finally, imagine your servers. Imagine that you take your first server and create a string by appending
the number 1 to its IP. Let’s call that string IP1-1. Next, imagine you have a second server that has
twice as much memory as server 1. Start with server #2’s IP, and create 2 strings from it by appending
1 for the first one and 2 for the second one. Call those strings IP2-1 and IP2-2. Finally, imagine you
have a third server that is exactly the same as your first server, and create the string IP3-1. Now, take
all those strings, hash them into 64-bit numbers, and stick them on the circle with your resources:
Can you see where this is headed? You have just solved the problem of which server to use for
resource A. You start where resource A is and head clockwise on the ring until you hit a server. If that
server is down, you go to the next one, and so on and so forth. In practice, you’ll want to use more
than 1 or 2 points for each server, but I’ll leave those details as an exercise for you, dear reader.
Now, allow me to use bullet points to explain how cool this is:
Assuming you’ve used a lot more than 1 point per server, when one server goes down, every other
server will get a share of the new load. In the case above, imagine what happens when server #2 goes
down. Resource A shifts to server #1, and resource B shifts to server #3 (Note that this won’t help if all
of your servers are already at 100% capacity. Call your VC and ask for more funding).
You can tune the amount of load you send to each server based on that server’s capacity. Imagine this
spatially – more points for a server means it covers more of the ring and is more likely to get more
You could have a process try to tune this load dynamically, but be aware that you’ll be stepping close
to problems that control theory was built to solve. Control theory is more complicated than consistent
hashing.
If you store your server list in a database (2 columns: IP address and number of points), you can bring
servers online slowly by gradually increasing the number of points they use. This is particularly
important for services that are disk bound and need time for the kernel to fill up its caches. This is one
way to deal with the datacenter variant of the Thundering Herd Problem.
Here I go again with the control theory — you could do this automatically. But adding capacity usually
happens so rarely that just having somebody sitting there watching top and running SQL updates is
probably fine. Of course, EC2 changes everything, so maybe you’ll be hitting the books after all.
If you are really clever, when everything is running smoothly you can go ahead and pay the cost of
storing items on both their primary and secondary cache servers. That way, when one server goes
consider what happens when machines fail. If the answer is “we crush the databases,” congratulations:
you will get to observe a cascading failure. I love this stuff, so hearing about cascading failures makes
Finally, you may not know this, but you use consistent hashing every time you put something in your
cart at Amazon.com. Their massively scalable data store, Dynamo, uses this technique. Or if you use
Last.fm, you’ve used a great combination: consistent hashing + memcached. They were kind enough
to release their changes, so if you are using memcached, you can just use their code without dealing
with these messy details. But keep in mind that there are more applications to this idea than just
simple caching. Consistent hashing is a powerful idea for anyone building services that have to scale
Would sticky sessions enabled on a load balancer help with the caching issue “If you want your cache
to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server
localhost
March 18, 2008 at 12:45 am
Al
March 23, 2008 at 3:07 am
Peter,
Sticking a users session to a server, web, application or other, isn’t going to help in determining what
particular server within your cache cluster has the key you want.
Al.
How To Split Randomly But Unevenly - PHP Code For Load UNBalancing (Utopia
Mechanicus)
Pingback on Apr 3rd, 2008 at 1:52 am
David
April 7, 2008 at 11:24 am
Thanks for sharing important logic like this that most people don’t think of until they are in a big mess.
Mike Zintel
April 15, 2008 at 9:34 pm
Paul Annesley
April 21, 2008 at 5:20 pm
Your clear explanation and illustrations inspired me to write an open source implementation for PHP,
as I couldn’t see anything decent around that fit the bill. I’ve put it on Google Code
Oh, there’s a java version in there too along with the libketama C library that is used for the PHP
extension
-= Linkage 2007.02.18 =-
Pingback on Jan 26th, 2009 at 7:37 am
Ross
March 26, 2010 at 3:44 am
Hej Tom,
I got half excited about your article… hits most of the squares… However your circle shows the servers
nicely equadistant on the 360 but if we leave their placement to the hash function then they could
actually all appear in a very accute part of the circle so that the services or resources could mostly be