A Memcached Implementation in Jgroups
A Memcached Implementation in Jgroups
A Memcached Implementation in Jgroups
Memcached
Memcached is a widely used cache, which can be distributed across a number of hosts. It is is a
hashmap storing key/value pairs. Its main methods are set(K,V) which adds a key/value pair, get(K)
which returns a value for a previously inserted key and delete(K) which removes a key/value pair.
Memcached is started on a given host and listens on a given port (11211 being the default). The
daemon is written in C, but clients can be written in any language and talk to the daemon via the
memcached protocol.
Typically, multiple memcached daemons are started, on different hosts. The clients are passed a list of
memcached addresses (IP address and port) and pick one daemon for a given key. This is done via
consistent hashing, which always maps the same key K to the same memcached server S. When a
server crashes, or a new server is added, consistent hashing makes sure that the ensuing rehashing is
minimal. Which means that most keys still map to the same servers, but keys hashing to a removed
server are rehashed to a new server.
Memcached does not provide any redundancy (e.g. via replication of its hashmap entries); when a
server S is stopped or crashes, all key/value pairs hosted by S are lost.
The main goal of memcached is to provide a large distributed cache sitting in front of a database (or
file system). Applications typically ask the cache for an element and return it when found, or else ask
the DB for it. In the latter case, the element is added to the cache before returning it to the client and
will now be available from the cache on the next cache access.
This speeds up applications with good cache locality (e.g. fetching web pages comes to mind), because
even a round trip in a LAN is typically faster then a round trip to the DB server, plus the disk access for
reading a SELECT statement.
In addition, clients now have a huge aggregated cache memory. If we start 10 memcached daemons of
2GB memory each, then we have a 20GB (virtual) cache. This is bigger than most physical memory
sizes of most hosts today (2008).
Figure 1 shows a typical use of memcached. The example shows an instance of Apache which serves
HTTP requests and a Python client running in the mod_python Apache module. There are 3 instances
of memcached started, on hosts A, B and C. The Python client is configured with a list of [A,B,C].
When a request hits the Python code that requires a DB lookup, the Python code first checks the cache
by hashing the key to a server. Let's assume K is hashed to C, so now the client sends a GET(K) request
to C via the memcached protocol and waits for the result. If the result is not null, it is returned to the
caller as an HTTP response. If not, the value for K is retrieved from the DB, inserted into the cache (on
host C) and then returned to the client. This ensures that the next lookup for K (by any client) will find
K,V in the cache (on host C) and does not have to perform a DB lookup.
A modification to the DB is typically also followed by a cache insertion: so if K/V is modified, then we
execute a SET(K,V,T) against the cache after the DB insertion.
Each SET request carries an expiry timestamp T with it, which lets the client define when the entry
should be evicted from the cache. The time stamp is given in seconds, 0 means don't expire 1 and -1
means don't cache.
Host A
memcached
Apache httpd
Host B
memcached
HTTP
mod_python
Host C
memcached
JGroups implementation
PartitionedHashMap is an implementation of memcached on top of JGroups, written completely in
Java. It has a couple of advantages over memcached:
● Java clients and PartitionedHashMap can run in the same address space and therefore don't need
to use the memcached protocol to communicate. The latter is text based 2 and slow, due to
serialization. This allows servlets to access the cache directly, without serialization overhead.
● All PartitionedHashMap processes know about each other, and can therefore make intelligent
decisions as to what to do when a cluster membership change occurs. For example, a server to
1 Of course, entries with a timestamp of 0 can still be evicted when the cache needs to make room for more entries.
2 There has been talk of a binary protocol for memcached, but this hasn't materialized yet (as of 2008)
be stopped can migrate all of the keys it manages to the next server. With memcached, the
entries hosted by a server S are lost when S goes down. Of course, this doesn't work when S
crashes.
● Similat to the above point, when a cluster membership change occurs (e.g. a new server S is
started), then all servers check whether an entry hosted by them should actually be hosted by S.
They will move all entries to be hosted by S to S. This has the advantage that entries don't have
to be re-read from the DB (for example) and inserted into the cache (as in memcached's case),
but the cache rebalances itself automatically.
● PartitionedHashMap has a level 1 cache (L1 cache). This allows for caching of data near to
where it is really needed. For example, if we have servers A, B, C, D and E and a client adds a
(to be heavily accessed) news article to C, then memcached would always redirect every single
request for the article to C. So, a client accessing D would always trigger a GET request from D
to C and then return the article. JGroups caches the article in D's L1 cache on the first access, so
all other clients accessing the article from D would get the cached article, and we can avoid a
round trip to C. Note that each entry has an expiration time, which will cause the entry to be
removed from the L1 cache on expiry, and the next access would have to fetch it again from C
and place it in D's L1 cache. The expiration time is defined by the submitter of the article.
● Since the RPCs for GETs, SETs and REMOVEs use JGroups as transport, the type of transport
and the quality of service can be controlled and customized through the underlying XML file
defining the transport. For example, we could add compression, or decide to encrypt all RPC
traffic. It also allows for use of either UDP (IP multicasting and/or UDP datagrams) or TCP.
● The connector (org.jgroups.blocks.MemcachedConnector) which is responsible for parsing the
memcached protocol and invoking requests on PartitionedHashMap, PartitionedHashMap
(org.jgroups.blocks.PartitionedHashMap) which represents the memcached implementation, the
server (org.jgroups.demos.MemcachedServer) and the L1 and L2 caches
(org.jgroups.blocks.Cache) can be assembled or replaced at will. Therefore it is simple to
customize the JGroups memcached implementation; for example to use a different
MemcachedConnector which parses a binary protocol (requiring matching client code of
course).
● All management information and operations are exposed via JMX.
Servlet PartitionedHashmap
Apache httpd
JBoss / Tomcat Host B
HTTP
Servlet PartitionedHashmap
mod_jk
Servlet PartitionedHashmap
The local cache instance will hash each key K to either host A, B or C. So, for example, if the servlet
on host A does a GET(K), and K is hashed to C, then the PartitionedHashMap on A will forward the
GET(K) to C. C returns the value associated with K to A. If A has an L1 cache enabled, then A will
cache K,V (depending on K's expiration time). The next time a client on A accesses K, V will be
returned from A's L1 cache directly.
Optimization
Note that because we don't run Python clients, but Java clients, we could use a hardware load balancer
which directs HTTP requests for static content to Apache, and dynamic content directly to one of the 3
JBoss servers, as shown in Fig. 3.
Note that we don't need sticky request balancing here (unless the servlet also accesses HTTP sessions
which are replicated), but the cache is globally available. Of course, sticky request balancing might
help increasing cache locality: if client C is always directed to host B, then B's L1 cache will already
contain the most frequently accessed entries by C.
Fig. 3 shows that the hardware load balancer let's Apache handle static web pages, but directs requests
for dynamic content to one of the 3 servers.
Over time, each of the 3 servers will build up a bounded cache of the most frequently accessed entries,
Servlet PartitionedHashmap
mod_jk
Servlet PartitionedHashmap
MemcachedServer
The main class to start the JGroups memcached implementation is
org.jgroups.demos.MemcachedServer. It creates
1. an L1 cache (if configured)
2. a L2 cache (that's the default hashmap storing all entries)
3. and a MemcachedConnector
The latter parses the memcached protocol and is optional, for example, if you run inside of JBoss. In
this case, you'd only create a PartitionedHashMap instance and possibly set the L1 cache in it.
The command line options for MemcachedServer are ('-help' dumps all options)
-bind_addr The network interface on which to listen, e.g. 192.168.1.5. Used by
MemcachedConnector
-port / -p The port (default: 11211). Used by MemcachedConnector
-props The JGroups properties. This defines the cluster. Used by
PartitionedHashMap
-min_threads The core threads in the thread pool. Used by MemcachedConnector
-max_threads The max threads in the thread pool. Used by MemcachedConnector
-rpc_timeout The max time in milliseconds to wait for a blocking RPC to return, e.g. a GET
request. Used by PartitionedHashMap.
-caching_time The default (in millisconds) expiry time. 0 means never evict. This can be
overridden in a SET. Used by PartitionedHashMap.
-migrate_data Whether or not to migrate entries to a new host after a cluster membership
change (default: true). Used by PartitionedHashMap.
-use_l1_cache Whether or not to use an L1 cache
-l1_max_entries The max number of entries in the L1 cache (if enabled). Default: 10000
-l1_reaping_interval The reaping interval (in ms) with which we look for expired entries and evict
them (default: -1, disabled)
-l2_max_entries The max number of entries in the default cache. Default is -1 (disabled,
unbounded cache)
-l2_reaping_interval The reaping interval (in ms) with which we look for expired entries and evict
them (default: 30000)
API: PartitionedHashMap
The API is very simple:
● public void put(K key, V val): puts a key / value pair into the cache with the default caching
time
● public void put(K key, V val, long caching_time): same as above, but here we can explicitly
define the caching time. 0 means cache forever, -1 means don't cache and any positive value is
the number of milliseconds to cache the entry
● public V get(K key): gets a value for key K
● public void remove(K key): remove a key / value pair from the cache (L2 and L1, if enabled)
Code sample
PartitionedHashMap<String,String> map=new PartitionedHashMap<String,String>(props, "demo-
cluster");
map.setMigrateData(migrate_data);
map.start();
l1_cache.stop();
map.stop();
The example first creates an instance of PartitionedHashMap, which accepts String as both keys and
values. Then we configure an L1 cache which is bounded (5 entries max) and doesn't have reaping
enabled. This means, when we insert an entry and the size of the L1 cache exceeds 5, the oldest entry
will be removed to bring the size back to 5.
The next step configures the L2 cache (the main cache) to look for entries to evict every 10 seconds.
Then we enable data migration on cluster view changes and start the cache. The start() method is
important, because it starts the JGroups channel plus the reaper timer.
Then we insert an entry, fetch it again and finally remove it. After printing out some information about
the cache, we stop it.
-------------------------------------------------------
GMS: address is 192.168.1.5:33377
-------------------------------------------------------
view = [192.168.1.5:33377|0] [192.168.1.5:33377]
0 [TRACE] PartitionedHashMap: - node mappings:
138: 192.168.1.5:33377
-------------------------------------------------------
GMS: address is 192.168.1.5:38613
-------------------------------------------------------
view = [192.168.1.5:33377|1] [192.168.1.5:33377, 192.168.1.5:38613]
0 [TRACE] PartitionedHashMap: - node mappings:
138: 192.168.1.5:33377
902: 192.168.1.5:38613
The second instance is started at port 11212 and its JGroups address is 192.168.1.5:38613. We can see
that the 2 instances formed a cluster; both instances should have a view of
[192.168.1.5:33377|1] [192.168.1.5:33377, 192.168.1.5:38613] .
This means we added a key “name” with a 4-byte value of “Bela”. Let's add another key/value:
set id 0 0 6
322649
STORED
To get the value associated with “name” and “id”, execute the following:
get id name
3 The client would have to be configured with both servers: 192.168.1.5:11211 and 192.168.1.5:11212
VALUE id 0 6
322649
VALUE name 0 4
Bela
END
We can see in the other shell that those keys were added by the second instance:
867581 [TRACE] PartitionedHashMap: - _put(last_name, [B@3a747fa2, 0)
867685 [TRACE] PartitionedHashMap: - _put(name, [B@2092dcdb, 0)
867751 [TRACE] PartitionedHashMap: - _put(id, [B@29d381d2, 0)
view = [192.168.1.5:33377|2] [192.168.1.5:38613]
867871 [TRACE] PartitionedHashMap: - node mappings:
902: 192.168.1.5:38613
We also see that the cluster view now only shows one member: instance two.
A GET on the telnet session connected to instance two reveals that the keys are indeed present:
get name id hobbies last_name
VALUE name 0 4
Bela
VALUE id 0 6
322649
VALUE hobbies 0 4
none
VALUE last_name 0 3
ban
END
This shows that key/value pairs for “name”, “last_name” and “id” were hosted by instance one and
migrated to instance two on stopping instance one.
Tips and Tricks
JGroups configuration
For optimal performance, message bundling should be disabled for unicast messages (which are used
for RPCs, e.g. remote GET requests). This is done by setting enable_unicast_bundling to false in the
transport, e.g. UDP or TCP. The reason for this is described here
(http://wiki.jboss.org/wiki/ProblemAndClusterRPCs) in more details.