0% found this document useful (0 votes)
7 views17 pages

nilext

This paper introduces nil-externality, a property of storage interfaces that allows for higher performance by deferring the externalization of state changes. It presents S KYROS, a nilext-aware replication protocol that leverages nil-externality to achieve significant performance improvements over traditional consensus-based methods like Paxos, particularly in terms of latency and throughput. The authors demonstrate that S KYROS can outperform existing protocols in various workloads while maintaining strong consistency guarantees.

Uploaded by

zahak.javal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

nilext

This paper introduces nil-externality, a property of storage interfaces that allows for higher performance by deferring the externalization of state changes. It presents S KYROS, a nilext-aware replication protocol that leverages nil-externality to achieve significant performance improvements over traditional consensus-based methods like Paxos, particularly in terms of latency and throughput. The authors demonstrate that S KYROS can outperform existing protocols in various workloads while maintaining strong consistency guarantees.

Uploaded by

zahak.javal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Exploiting Nil-Externality for Fast Replicated Storage

Aishwarya Ganesan Ramnatthan Alagappan


VMware Research VMware Research

Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau


University of Wisconsin – Madison University of Wisconsin – Madison

Abstract may modify state within a storage system but does not exter-
Do some storage interfaces enable higher performance than nalize its effects or system state immediately to the outside
others? Can one identify and exploit such interfaces to re- world (apart from the acknowledgment itself). As a result,
alize high performance in storage systems? This paper an- a storage system can apply a nilext operation in a deferred
swers these questions in the affirmative by identifying nil- manner after acknowledgment, improving performance.
externality, a property of storage interfaces. A nil-externalizing In this paper, we exploit nil-externality to design high-
(nilext) interface may modify state within a storage system performance replicated storage that offers strong consis-
but does not externalize its effects or system state immedi- tency (i.e., linearizability [36]). A standard approach today
ately to the outside world. As a result, a storage system can to building such a system is to use a consensus protocol like
apply nilext operations lazily, improving performance. Paxos [44], Viewstamped Replication (VR) [52], or Raft [62].
In this paper, we take advantage of nilext interfaces to For example, Facebook’s ZippyDB uses Paxos to replicate
build high-performance replicated storage. We implement RocksDB [73]; Harp builds a replicated file system using
S KYROS, a nilext-aware replication protocol that offers high VR [53]; other examples exist as well [7, 17, 18, 22].
performance by deferring ordering and executing operations A storage system built using this standard approach per-
until their effects are externalized. We show that exploit- forms several actions before it returns a response to a request.
ing nil-externality offers significant benefit: for many work- Roughly, the system makes the request durable (if it is an
loads, S KYROS provides higher performance than standard update), orders the request with respect to other requests,
consensus-based replication. For example, S KYROS offers and finally executes the request. Usually, a leader replica
3× lower latency while providing the same high throughput orchestrates these actions [52, 62]. Upon receiving requests,
offered by throughput-optimized Paxos. the leader decides the order and then replicates the requests
(in order) to a set of followers; once enough followers re-
1 Introduction spond, the leader applies the requests to the system state and
returns responses. Unfortunately, this process is expensive:
Defining the right interfaces is perhaps the most important
updates incur two round trips (RTTs) to complete.
aspect of system design [46], as well-designed interfaces of-
The system can defer some or all of these actions to im-
ten lead to desirable properties. For example, idempotent in-
prove performance. Deferring durability, however, is unsafe:
terfaces make failure recovery simpler [13, 70]; commutative
if an acknowledged write is lost, the system would violate
interfaces enable scalable software implementations [14].
linearizability [31, 48]. Fortunately, durability can be ensured
In a similar spirit, this paper asks: Do some types of in-
without coordination: clients can directly store updates in
terfaces enable higher performance than others in storage
a single RTT on the replicas [64, 80]. However, ordering
systems? Our exercise in answering this question has led us
(and subsequent execution) requires coordination among the
to identify an important storage-interface property which
replicas and thus is expensive. Can a system hide this cost
we call nil-externality. A nil-externalizing (nilext) interface
by deferring ordering and execution?
Permission to make digital or hard copies of all or part of this work for At first glance, it may seem like all operations must be
personal or classroom use is granted without fee provided that copies synchronously ordered and executed before returning a re-
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
sponse. However, we observe that if the operation is nilext,
for components of this work owned by others than the author(s) must then it can be ordered and executed lazily because nilext
be honored. Abstracting with credit is permitted. To copy otherwise, or operations do not externalize state or effects immediately.
republish, to post on servers or to redistribute to lists, requires prior specific Nilext interfaces have performance advantages, but are
permission and/or a fee. Request permissions from permissions@acm.org. they practical? Perhaps surprisingly, we find that nilext inter-
SOSP ’21, October 26–29, 2021, Virtual Event, Germany
faces are not just practical but prevalent in storage systems
© 2021 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
(§2). As a simple example, consider the put interface in the
ACM ISBN 978-1-4503-8709-5/21/10. . . $15.00 key-value API. Put is nilext because it does not externalize
https://doi.org/10.1145/3477132.3483543
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

the state of the key-value store: it does not return an execu- We build S KYROS, a new protocol that adapts state ma-
tion result or an execution error (for instance, by checking chine replication [71] to take advantage of nilext interfaces
if the key already exists). In fact, popular key-value stores (§4). The main challenge in our design is to ensure lineariz-
such as RocksDB [29], LevelDB [33], and others built atop ability (especially during view changes) while maintain-
write-optimized structures (like LSMs [63] and B𝜖 -trees [8]) ing high performance. To this end, S KYROS applies many
transform all updates into nilext writes by design; querying techniques. S KYROS first uses supermajority quorums and
a write-optimized structure before every update can be very a new durability-log design to complete nilext writes in
expensive [6]. Thus, in these systems, even updates that read one RTT. Second, S KYROS implements an ordering-and-
prior state and modify data are nilext (in addition to blind execution check to serve reads in one RTT. Finally, S KYROS
writes that simply overwrite data). employs a DAG-based order-resolution technique to recon-
Nilext-aware replication is a new approach to replication struct the linearizable order during view changes.
that takes advantage of nil-externality of storage interfaces While S KYROS defers ordering, Generalized Paxos [45],
(§3). The key idea behind this approach is to defer ordering Curp [64], and other protocols [58, 65] realize that ordering
and executing operations until their effects are externalized. is in fact not needed when operations commute. However,
Because nilext updates do not externalize state, they are made these protocols incur overhead when writes conflict and
durable immediately, but expensive ordering and execution when interface operations do not commute. For instance,
are deferred, improving performance. The effects of nilext when multiple writers append records to a file (a popular
operations, however, can be externalized by later non-nilext workload in GFS [32]), these protocols incur high overhead (2
operations (e.g., a read to a piece of state modified by a or 3 RTTs in Curp). In contrast, S KYROS can defer ordering
nilext update). Thus, nilext operations must still be applied such operations because they are nilext. More importantly,
in the same (real-time) order across replicas for consistency. nil-externality is compatible with commutativity: a nilext-
This required ordering is established in the background and aware protocol can also exploit commutativity to quickly
enforced before the modified state is externalized. While commit non-nilext updates. We build S KYROS -C OMM, a vari-
nilext interfaces lead to high performance, it is, of course, ant of S KYROS to demonstrate this compatibility.
impractical to make all interfaces nilext: applications do need Our experiments (§5) show that S KYROS offers 3× higher
state-externalizing updates (e.g., increment and return the throughput than Paxos (without batching) for a nilext-only
latest value, or return an error if key is not present). Such workload. While batching improves Paxos’ throughput, at
non-nilext updates are immediately ordered and executed peak throughput, S KYROS offers 3.1× lower latency. We run
for correctness. extensive microbenchmarks, varying request ratios, distribu-
Nilext-aware replication delivers high performance in tions, and read-latest fractions. S KYROS outperforms Paxos
practice. First, while applications do require non-nilext up- (with batching) in most cases; even when pushed to extremes
dates, such updates are less frequent than nilext updates. (e.g., all non-nilext writes), S KYROS performs as well as
For instance, nilext set is the most popular kind of update Paxos. Under write-heavy YCSB workloads, S KYROS is 1.4×
in Memcached [1]. Similarly, put, delete, and merge (read- to 2.3× faster. For read-heavy workloads, while through-
modify-writes that do not return results), which are all nilext, put gains are marginal, S KYROS reduces p99 latency by
are the dominant type of updates in ZippyDB [11]. We find 70%. We also use S KYROS to replicate RocksDB with high
similar evidence in production traces from IBM [24] and performance. Finally, we compare S KYROS to Curp [64],
Twitter [79]. Further, while reads do externalize state, not a recent commutative protocol. Curp performs well (like
every read triggers synchronous ordering. In many work- S KYROS) when operations commute. However, when opera-
loads, updates to an object can be ordered and executed in tions do not commute but are nilext, S KYROS offers advan-
the background before applications read the object. Our anal- tages: S KYROS provides 2× better throughput for file record
yses of production traces from IBM COS [24] reveal that this appends and 2.7× lower p99 latency in a key-value store.
is indeed the case (§3.3). S KYROS -C OMM combines the best of both worlds: it quickly
Nilext-aware replication draws inspiration from the gen- completes nilext operations and exploits commutativity to
eral idea of deferring work until needed similar to lazy eval- speedup non-nilext operations.
uation in functional languages [37], externally synchronous This paper makes four contributions.
file I/O [60], and previous work in databases [30, 68]. Here, • We first identify nil-externality, a property of storage in-
we apply this general idea to hide the cost of ordering and terfaces, and show its prevalence.
execution in replicated storage. Prior approaches like specu- • We show how one can exploit this property to improve
lative execution [41, 42, 67] reduce ordering cost by eagerly the performance of strongly consistent storage systems.
executing and then verifying that the order matches before • Third, we present the design and implementation of S KYROS,
notifying end applications. Nilext-aware replication, in con- a nilext-aware replication protocol.
trast, realizes that some operations can be lazily ordered and • Finally, we demonstrate the performance benefits of S KYROS
executed after notifying end applications of completion. through rigorous experiments.
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

2 Nil-Externalizing Interfaces System


Update
Read
Nilext Non-nilext
We first define nil-externality and describe its attributes. We put,write,
next analyze which interfaces are nilext in three example RocksDB get,multiget
delete,merge
storage systems; then, we discuss opportunities to improve put,write,
LevelDB get,multiget
performance by exploiting nilext interfaces in general. delete
add𝑒 ,delete𝑒 ,cas𝑟 ,replace𝑒 ,
2.1 Nil-externality Memcached set
append𝑒 ,decr𝑟 ,incr𝑟 ,prepend𝑒
get,gets
We define an interface to be nil-externalizing if it does not Table 1. Nil-externality in Storage Systems. The table shows
externalize storage-system state: it does not return an execu- which operations are nilext in popular key-value systems. I 𝑒 denotes that
tion result or an execution error, although it might return an update interface I is non-nilext because it returns an execution error (e.g., key
acknowledgment. A nilext interface can modify state in any not found); I 𝑟 denotes a non-nilext update that returns an execution result.
way (blindly set, or read and modify). The state modified by
a nilext operation can be externalized at a later point by an- In Memcached, set is nilext because it does not return
other non-nilext operation (e.g., a read). Note that although an execution result or an error; all other update interfaces
nilext operations do not return an execution error, they may are non-nilext. However, as we soon show (§3.3), these non-
return a validation error. Validation errors (e.g., a malformed nilext updates are used only rarely compared to nilext set.
request) do not externalize state and can be detected without Nilext updates can be completed faster than non-nilext
executing the operation. Thus, an operation that returns only ones because their ordering and execution can be deferred.
validation errors (but not execution errors) is nilext. Thus, operations such as put and set in the above systems
Determining whether or not an operation is nilext is sim- can be completed quickly, improving performance. What
ple in most cases. Nil-externality is an interface-level prop- such opportunities exist across storage systems in general?
erty: it suffices to look at the interface (specifically, the return A typical storage system supports three kinds of operations:
value and the possible execution errors) to say if an operation reads, writes, and RMWs [10, 76]. While reads are non-nilext,
is nilext. Nil-externality is a static property: it is independent writes and RMWs can be further classified based on whether
of the system state or the arguments of an operation; one can or not they externalize state. Thus, some writes are nilext
therefore determine if an operation is nilext without having (e.g., RocksDB put), while others are not (e.g., Memcached
to reason about all possible system states and arguments. add); similarly, some RMWs are nilext (e.g., RocksDB merge),
2.2 Nil-externality in Storage Systems while some are not (e.g., Memcached incr). A system can
We now analyze which interfaces are nilext in three storage lazily apply all such nilext updates to improve performance.
systems that expose a key-value API (see Table 1). We pick Note that while nilext operations do not return errors as
these systems as candidates given their widespread use [11, part of their contract, a system that lazily applies nilext writes
27, 55, 61]; exploiting nilext interfaces in these systems to may encounter errors (e.g., due to insufficient disk space or
improve performance can benefit many deployments. a bad block) at a later point. A storage system that eagerly
RocksDB and LevelDB are LSM-based [63] key-value stores. applies updates can detect such errors early on. Fortunately,
Put in these systems is a nilext interface: it does not return this difference is not an obstacle to realizing the benefits of
an execution result or an error by checking record-existence. nilext interfaces in practice as we discuss later (§4.8).
Similarly, write (multi-put) is also nilext. Delete is nilext be- Given the benefits of nilext interfaces, it is worthwhile
cause it does not return an error if the key is not present; it to make small changes to a non-nilext interface’s semantics
simply inserts a tombstone for the key. Surprisingly, even to make it nilext when possible. For instance, a Btree-based
read-modify-writes (RMW) are nilext. RocksDB supports store may return an error upon an update to a nonexistent
RMW via the merge operator [28], which is implemented key; changing the semantics to not return such an error can
as an upsert [6]. An upsert encodes a modification by spec- enable a system to replicate updates quickly. Such semantic
ifying a key 𝑘 and a function 𝐹 that transforms the value changes have been practical and useful in the past: MySQL-
of 𝑘. In RocksDB and other stores [15, 33] built upon write- TokuDB supports SQL updates that do not return the number
optimized structures (LSMs and B𝜖 -trees), reading the value of records affected to exploit TokuDB’s fast upserts [66].
of a key before updating it is expensive [6, 11, 28]. Thus,
an upsert is not immediately applied, but the function and 3 Nilext-aware Replication
the key are simply recorded. Since an upsert is not applied We now describe how a replicated storage system can exploit
immediately, it does not return an execution result or an nil-externality to improve performance. To do so, we first
execution error and thus merge is nilext. In fact, all modifi- give background on consensus, a standard substrate upon
cations in write-optimized stores are a form of upserts that which strongly consistent storage is built. We then describe
avoid querying before updates [6], and thus are all nilext; for the nilext-aware replication approach and show that its high-
instance, the tombstone inserted upon a delete is an upsert. performance cases are common in practice. We finally discuss
Finally, get externalizes system state and so is not nilext. how this new approach compares to existing approaches.
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

read: read: updates


write 2 RTT read 1 RTT nilext 1 RTT
updates 1 RTT not finalized yet 2 RTT
Client write
Client finalized or non-nilext write
Leader

Followers Leader
background finalize: synchronously finalize:
Figure 1. Request Processing in Consensus. The figure shows asynchronously order & execute now
Followers order & execute
how writes and reads are processed in systems built atop consensus protocols. Fast operations Slow operations

3.1 Consensus-based Replication Background Figure 2. Nilext-aware Replication. The figure shows how a nilext-
aware replication protocol handles different operations.
Consensus protocols (e.g., Paxos, VR) 1ensure
read
RTT
that replicas
execute operations in the same order. Clients submit opera- protocol must ensure that replicas apply the updates in the
tions to the leader which then ensures that replicas agree on same order and it has to do so before the modifications are
a consistent ordering of operations before executing them. externalized. Thus, upon receiving a read, the leader checks
Figure 1 shows how requests are processed in the failure- if there are any unfinalized updates that this read depends
free case. Upon an update, the leader assigns an index, adds upon. If no, it quickly serves the read. Conversely, if there are
the request to its log, and sends a prepare to the followers. unfinalized updates, the leader synchronously establishes
The followers add the request to their logs and respond with a the order and waits for enough followers to accept the order;
prepare-ok. Once the leader receives prepare-ok from enough the leader then applies the pending updates and serves the
followers, it applies the update and returns the result to the read. In practice, most reads can be served without trigger-
client. Reads are usually served by the leader locally; the ing synchronous ordering and execution because the leader
leader is guaranteed to have seen all updates and so can keeps finalizing updates in the background; thus, in most
serve the latest data, preserving linearizability. Stale reads cases, updates are finalized already by the time a read arrives.
on a deposed leader can be prevented using leases [52]. Finally, the protocol does not defer ordering and executing
Latency is determined by the message delays in the proto- non-nilext updates. Clients submit non-nilext requests to the
col: updates take two RTTs and reads one RTT. Throughput leader which finalizes the request by synchronously ordering
is determined by the number of messages processed by the and executing it (and the previously completed requests).
leader [21]. Practical systems [3] batch requests to reduce the A nilext-aware protocol can complete nilext updates in one
load on the leader. While batching improves throughput, it RTT; non-nilext updates take two RTTs. A read can be served
increases latency, a critical concern for applications [67, 69]. in one RTT if prior nilext updates that the read depends
upon are applied before the read arrives. Thus, exploiting nil-
3.2 Exploiting Nil-externality for Fast Replication externality offers benefit if a significant fraction of updates
Using an off-the-shelf consensus protocol to build replicated is nilext and reads do not immediately follow them. We next
storage leads to inefficiencies because this approach is obliv- show that these conditions are prevalent in practice.
ious to the properties of the storage interface. In particular,
it is oblivious to nil-externality: all updates are immediately
ordered and executed. Our hypothesis is that a replication 3.3 Fast Case is the Common Case
protocol can deliver higher performance if it is cognizant of We first analyze the prevalence of nilext updates. First, we
the underlying storage interface. Specifically, if a protocol is note that in some systems, almost all updates are nilext (e.g.,
aware of nil-externality, it can delay ordering and execution, write-optimized key-value stores as shown in Table 1). Some
improving performance. We now provide an overview of systems like Memcached have many non-nilext interfaces.
such a protocol. We describe the detailed design soon (§4). However, how frequently do applications use them? To an-
A nilext-aware protocol defers ordering and execution of swer this question, we examine production traces [75, 79]
operations until their effects are externalized. Figure 2 shows from Twemcache, a Memcached clone at Twitter [74]. The
how such a protocol handles different operations. First, nilext traces contain ~200 billion requests across 54 clusters. Twem-
writes are made durable immediately, but their ordering and cache supports 9 types of updates (similar to Memcached as
execution are deferred. Clients send nilext writes to all repli- shown in Table 1). Except for set, others are non-nilext.
cas. Clients wait for enough replies including one from the We consider 29 clusters that have at least 10% updates.
leader before they consider the request to be completed. Figure 3(a) shows the distribution of nilext percentages. In
Nilext writes thus complete in one RTT. At this point, the op- Twemcache, in 80% of the clusters, more than 90% of updates
eration is durable and considered complete; clients can make are nilext (set). This aligns with Memcached’s expected us-
progress without waiting for the operation to be ordered and age [1] that most updates are sets and others are only spar-
executed. We say that an operation is finalized when it is ingly used. Also, among the eight non-nilext updates, appli-
assigned an index and applied to the storage system. cations used only five: add, cas, delete, incr, and prepend.
State modified by nilext updates can be externalized later Among these, only incr and cas return an execution result,
by other non-nilext operations (e.g., reads). Therefore, the while others return execution errors; perhaps changing the
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

100 Twemcache
% of clusters 75 IBM COS 3.4.1 Efficient Ordering in Consensus. Prior approaches
50
25 to efficient ordering broadly fall into three categories.
0
0-10 10-30 30-50 50-70 70-90 90-100 Network Ordering. This approach enforces ordering in the
% of nilext updates
(a) Nilext Percentage
network [21, 50]: the network consistently orders requests
100 across replicas in one RTT, improving performance. In con-
% of clusters

75 1s 50ms
50 trast, a nilext-aware protocol does not require a specialized
25
0
0-5 5-10 10-20 20-30 30-50 50-70 70-100
network and thus applies to geo-replication as well.
% of reads within Tf Speculative Execution. This approach employs speculative
(b) Synchronous Reads execution to reduce ordering cost [42, 67]. Replicas specula-
Figure 3. Fast Case is Common. (a) shows the distribution of nilext tively execute requests before agreeing on the order. Clients
percentages; a bar for a range x%-y% shows the percentage of clusters where then compare responses from different replicas to detect
x%-y% of updates are nilext. (b) shows the distribution of percentage of reads inconsistencies and replicas rollback their state upon diver-
within 𝑇𝑓 ; a bar for x%-y% shows the percentage of clusters where x%-y% of
reads access objects updated within 𝑇𝑓 . We consider 𝑇𝑓 =1s, 50ms.
gence. Replicas can thus be in an inconsistent state before the
end application is acknowledged. However, when end appli-
interface (to not return errors) can enable a replication pro- cation is notified, the system ensures that the requests have
tocol to realize higher performance. been executed in the correct order. In contrast, the nature
We performed a similar analysis on the IBM-COS traces of nilext interfaces allows one to defer ordering and execu-
across 35 storage clusters with at least 10% writes (out of tion even after the application is notified of completion; only
98 in total) [24]. COS supports three kinds of updates: put, durability must be ensured before notifying. Ordering and
copy, and delete. While put and copy are nilext, delete is execution are performed only when the effects are external-
not; it returns an error if the object does not exist. In about ized by later operations. Also, a nilext-aware protocol does
65% clusters, more than half of the updates are nilext; these not require replicas to do rollbacks, reducing complexity.
operations can be completed quickly. Again, if the semantics Exploiting Commutativity. This approach (used in Gen-
of delete can be modified, all updates can be made faster. eralized Paxos [45], EPaxos [58]) realizes that ordering is
We next analyze how often reads may incur overhead. A not needed when updates commute. Both commutative and
read will incur overhead if there are unfinalized updates to nilext-aware protocols incur overhead when reads access
the object being read. Let 𝑇𝑓 be the time taken to finalize unfinalized updates. However, as we show (§5.7), commuta-
updates. We thus measure the time interval between a read to tive protocols can be expensive when updates conflict and
an object and the prior write to the same object, and calculate when operations do not commute. Nilext-aware replication,
the percentage of reads for which this interval is less than in contrast, always completes nilext updates in one RTT.
𝑇𝑓 . We use the IBM-COS traces for this analysis because the Finally, nil-externality and commutativity are not at odds: a
Twemcache traces do not have millisecond-level timestamps. nilext-aware protocol can exploit commutativity to commit
Figure 3(b) shows the distribution of percentage of reads non-nilext writes faster (§5.7).
that access items updated within 𝑇𝑓 . We first consider 𝑇𝑓 to
be 1s. Even with such an unrealistically high 𝑇𝑓 , in 66% of 3.4.2 Other Approaches to Replicated Storage. Shared
clusters, only less than 5% of reads access objects modified registers [4], primary-backup [9], and chain replication [76]
within 1s. We next consider a more realistic 𝑇𝑓 of 50ms. offer other ways to building replicated storage. Storage sys-
𝑇𝑓 =50ms is realistic (but still conservative) because these tems that support only reads and writes can be built us-
traces are from a setting where replicas are in different zones ing registers which are not subject to FLP impossibility [4].
of the same geographical region, and inter-zone latencies are However, shared registers cannot readily enable RMWs [2,
~2 ms [38]. With 𝑇𝑓 =50 ms, in 85% of clusters, less than 5% 10], a common requirement in modern storage APIs. Start-
of reads access objects modified within 50 ms; thus, only a ing with state machines as the base offers more flexibility
small fraction of reads in a nilext-aware protocol may incur and exploiting nil-externality when possible leads to high
overhead in practice. Further, not all such reads will incur performance. Gryff [10] combines registers (for reads and
overhead due to prior reads to unfinalized updates and non- writes) and consensus (for RMWs); however, Gryff’s writes
nilext updates that would force synchronous ordering. take 2 RTTs. Primary-backup, chain replication, and other
approaches [19] support a richer API. However, primary-
3.4 Comparison to Other Approaches backup also incurs 2 RTTs for updates [51, 64]; similarly,
While nilext-aware replication defers ordering, prior work updates in chain replication also incur many message delays.
has built solutions to efficient ordering. The nilext-aware The idea of exploiting nil-externality can be used to hide
approach offers advantages over such prior solutions. While the ordering cost in these approaches as well; we leave this
we focus on consensus-based approaches here, other ways extension as an avenue for future work.
to construct replicated storage systems exist; we discuss how Summary. Unlike existing approaches, nilext-aware replica-
exploiting nil-externality applies to them as well. tion takes advantage of nil-externality of storage interfaces.
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

Client interface Upcalls into storage system and the replicas apply the updates in the background (§4.3).
InvokeNilext(req) MakeDurable • Clients send read requests to the leader via InvokeRead.
sent to all replicas add a nilext update to durability log
wait for ack from supermajority Read When a read arrives, the leader does a Read upcall. If all
(including one from the leader) read item; returns <need_sync, data>
InvokeNonNilext(req)
updates that the read depends upon are already applied,
Apply
InvokeRead(req) apply request to state; optionally return result the read is served quickly; otherwise, the leader orders
sent only to the leader GetDurabilityLogEntries and executes updates before serving the read (§4.4).
wait for result from the leader used in background ordering and view-change
• Clients send non-nilext updates to the leader via Invo-
Figure 4. Client Interface and Upcalls. The figure shows the client keNonNilext; such updates are immediately finalized (§4.5).
interface and the upcalls the replication layer makes into the storage system.
4.2 Nilext Updates
It should perform well in practice: nilext updates contribute Clients send nilext updates directly to all replicas including
to a large fraction of writes and reads do not often access the leader to complete them in one RTT. Each request is
recent updates. This approach offers advantages over exist- uniquely identified by a sequence number, a combination of
ing efficient ordering mechanisms: it requires no network client-id and request number. Similar to VR, only replicas in
support; it can defer execution beyond request completion the normal state reply to requests and duplicate requests are
and does not require rollbacks; it offers advantages over and filtered. A replica stores the update by invoking MakeDurable.
combines well with exploiting commutativity. S KYROS replicas store these durable (but not yet ordered
or applied) updates in a separate durability log; each replica
4 S KYROS Design and Implementation thus has two logs: the usual consensus log and the durability
We now describe the design of S KYROS. We first provide an log. Once a replica stores the update in the durability log, it
overview (§4.1), describe normal operation (§4.2 – §4.5), and responds directly to the client; the replica adds its current
explain recovery and view change (§4.6). We then show the view number in the response. For a nilext update, clients
correctness of S KYROS (§4.7). We finally discuss practical wait for a supermajority of 𝑓 + ⌈𝑓 /2⌉ + 1 acknowledgments
issues we addressed in S KYROS (§4.8). in the same view including one from the leader of the view.
Figure 5(a)(i) shows how a nilext update 𝑎 is completed.
4.1 Overview Note that an update need not be added in the same posi-
We use VR (or multi-paxos) as our baseline to highlight the tion in the durability logs across replicas. For example, in
differences in S KYROS. VR tolerates up to 𝑓 failures in a Figure 5(b)(i), 𝑏 is considered completed although its posi-
system with 2𝑓 + 1 replicas. It is leader-based and makes tion is different across durability logs. Then, why do S KYROS
progress in a sequence of views; in each view, a single replica replicas use a durability log instead of a set? Using an un-
serves as the leader. VR implementations offer linearizabil- ordered set precludes the system from reconstructing the
ity [36]: operations are executed in real-time order, and each required ordering between updates upon failures. For exam-
operation sees the effect of ones that completed before it. ple, in Figure 5(b)(i) and (b)(ii), 𝑏 follows 𝑎 in real time (i.e.,
S KYROS preserves all these properties: it provides the same 𝑎 completed before 𝑏 started) and thus must be applied to
availability, is leader-based, and offers linearizability. the storage system only after 𝑎. A log captures the order in
In VR, the leader establishes an order by sending a prepare which the replicas receive the requests; S KYROS uses these
and waiting for prepare-ok from 𝑓 followers. The leader then logs to determine the ordering of requests upon failures.
does an Apply upcall into the storage system to execute Why is a simple majority (𝑓 + 1) insufficient? Consider
the operation. S KYROS changes this step in an important an update 𝑏 that follows another update 𝑎 in real-time. Let’s
way: while S KYROS makes updates immediately durable, it suppose for a moment that we use a simple majority. A possi-
defers ordering and executing them until their effects are ble state then is < 𝐷 1 : 𝑎𝑏, 𝐷 2 : 𝑎𝑏, 𝐷 3 : 𝑎𝑏, 𝐷 4 : 𝑏𝑎, 𝐷 5 : 𝑏𝑎 >,
externalized. To enable this, S KYROS augments the interface where 𝐷𝑖 is the durability log of replica 𝑆𝑖 . This state is possi-
between the storage system and the replication layer with ble because a client could consider 𝑎 to be completed once it
additional upcalls (as shown in Figure 4). During normal receives acknowledgment from 𝑆 1 , 𝑆 2 , and 𝑆 3 . Then, 𝑏 starts
operation, S KYROS processes different requests as follows: and is stored on all durability logs and so is considered com-
• Clients submit nilext updates to all replicas using Invoke- pleted. 𝑎 now arrives late at 𝑆 4 and 𝑆 5 . Assume the current
Nilext. Since nil-externality is a static property (it does not leader (𝑆 1 ) crashes. Now, we have four replicas whose logs are
depend upon the system state), clients can decide which < 𝐷 2 : 𝑎𝑏, 𝐷 3 : 𝑎𝑏, 𝐷 4 : 𝑏𝑎, 𝐷 5 : 𝑏𝑎 >. With these logs, one
requests are nilext and invoke the appropriate call. Upon cannot determine the correct order. A supermajority quorum
receiving a nilext update, replicas invoke the MakeDurable avoids this situation. Writing to a supermajority ensures that
upcall to make the operation durable (§4.2). a majority within any available majority is guaranteed to
• Although nilext updates are not immediately finalized, have the requests in the correct order in their durability logs.
they must be executed in the same real-time order across We later show how by writing to a supermajority, S KYROS
replicas. The leader gets the replicas to agree upon an order recovers the correct ordering upon failures (§4.6, §4.7).
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

(ii) read(a) (iii) b,c acked (iv) updates ordered


(i) b follows a a,b concurrent check is system-specific, which led to our design rationale
updates already read(c) and applied;
a complete
applied: serve a order and apply now serve c
d-log c-log d-log c-log d-log c-log d-log c-log
S1L a b a b b a a b of maintaining the durability log within the storage system,
S1L a a b c a a b c S2 a b a b a b a b giving it visibility in to the pending updates to perform the
S2 a a b c a b c a b c S3 a b a b a b a b check. The storage system maintains an efficient index (such
S3 a a b c a b c a b c S4 b a b a b b a as a hash table) to quickly lookup the log.
S4 a a b c a b c a S5 a b a a b b a If there are no pending updates, the storage system popu-
S5 (i) (ii) (iii) (iv)
time lates the response by reading the state, sets the need_sync bit
(a) Skyros writes and reads (b) durability log states
to 0, and returns the read value to the replication layer. The
Figure 5. S KYROS Writes and Reads, and Durability Log leader then returns the response to the client, completing
States. (a) shows how Skyros processes nilext writes and reads; d-log: dura-
bility, c-log: consensus log, L: leader; f=2 and supermajority=4. (b) shows the
the read in one RTT (e.g., read-a in Figure 5(a)(ii)).
possible durability logs for two completed nilext operations 𝑎 and 𝑏. In (i) and Conversely, if there are pending updates, the storage sys-
(ii), 𝑏 follows 𝑎 in real time, whereas in (iii) and (iv), they are concurrent. tem sets the need_sync bit. In that case, the leader synchronously
adds all requests from the durability log to the consensus
During normal operation, the leader’s durability log is log to order and execute them (e.g., read-c in Figure 5(a)(iii)).
guaranteed to have the updates in the correct order. This is Once 𝑓 followers respond, the leader applies all the updates
because a response from the leader is necessary for a request and then serves the read. Fortunately, the periodic back-
to complete. Thus, if an update 𝑏 follows another update 𝑎 ground finalization reduces the number of requests that must
in real-time, then the leader’s durability log is guaranteed to be synchronously ordered and executed during such reads.
have 𝑎 before 𝑏 (while some replicas may contain them in a
different order as in Figure 5(b)(ii)). This guarantee ensures 4.5 Non-nilext Updates
that when clients read from the leader, they see the writes If an update externalizes state, then it must be immediately
in the correct order. The leader uses this property to ensure ordered and executed. Clients send such non-nilext updates
that operations are finalized to the consensus log in the only to the leader. The leader first adds all prior requests in
correct order. If 𝑎 and 𝑏 are concurrent, they can appear in the durability log to the consensus log; it then adds the non-
the leader’s log in any order as in Figure 5(b)(iii) and (b)(iv). nilext update to the end of the consensus log and then sends a
prepare for all the added requests. Once 𝑓 followers respond,
4.3 Background Ordering and Execution the leader applies the non-nilext update (after applying all
While nilext updates not are immediately ordered, they must prior requests) and returns the result to the client.
be ultimately executed in the same real-time order across
4.6 Replica Recovery and View Changes
replicas. The leader is guaranteed to have all completed up-
dates in its durability log in real-time order. Periodically, So far, we have described only the failure-free operation. We
the leader takes an update from its durability log (via the now discuss how S KYROS handles failures.
GetDurabilityLogEntries upcall), adds it to the consensus log, Replica Recovery. Similar to VR, S KYROS does not write
and initiates the usual ordering protocol. Once 𝑓 followers log entries synchronously to disk (although it maintains
respond after adding the request to their consensus logs, the view information on disk). Thus, when a replica recovers
leader applies the update and removes it from its durability from a crash, it needs to recover its log. In VR, the replica
log. At this point, the request is finalized. As in VR, the leader marks its status as recovering, sends a Recovery message,
sends a commit for the finalized request; the followers apply and waits for a RecoveryResponse from at least 𝑓 + 1 replicas,
the update and then remove it from their durability logs. including one from the leader of the latest view it sees in
Note that this step is the same as in VR; once 𝑓 + 1 nodes these responses [52]. Then, it sets its log as the one in the
agree on the order, at least one node in any majority will leader’s response. The replica then sets its status to normal.
have requests in the correct order in its consensus log. Recovery in S KYROS is very similar with one change: the
The leader employs batching for the background work; replicas also send their durability logs in RecoveryResponse
it adds many requests to its consensus log and sends one and a replica sets its durability log as the one sent by the
prepare for the batch. Once 𝑓 followers respond, it applies leader. This step is safe because the leader’s durability log
batch and removes it from the durability log. contains all completed nilext updates in the correct order.
View Changes. In VR, when the leader of the current view
4.4 Reads fails, the replicas change their status from normal to view-
Clients read only at the leader in S KYROS (like in many change and run a view-change protocol. The new leader must
linearizable systems). When a read arrives, the leader does a recover all the committed operations in the consensus log
Read upcall. The storage system then performs an ordering before the system can accept requests. The new leader does
and execution check: it consults the durability log to check if this by waiting for 𝑓 other replicas to send a DoViewChange
there are any pending updates that this read depends upon. message [52]. In this message, a replica includes its view
For example, a key-value store would check if there is a number, its log, and the last view number in which its status
pending put or merge to the key being read. Note that this was normal. The leader then recovers the log by taking the
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

Truth: a, b, c complete but d is incomplete; a and b are concurrent, c follows a, b


1: procedure RecoverDurabilityLog
(ii) replicas participating :
2: 𝐷 ← durability logs in the highest normal view S1L a b c d (i) replicas participating :
S2,S3,S4 S2,S4,S5
𝐸 ← entries that appear in at least ⌈𝑓 /2⌉ + 1 logs in 𝐷 S2

Durability logs
3:
b a c d b E = {a,b,c,d} b E = {a,b,c}
4: for 𝑣 ∈ 𝐸 do
S3 G= a c NLD is bacd G = a c Possible NLDs:
5: add 𝑣 as a vertex in graph 𝐺 b a c d bac, abc
d (both are valid)
6: for every pair (𝑎, 𝑏) in 𝐸 do S4 a b c
7: 𝑛 1 ← number of logs in 𝐷 where 𝑎 appears before 𝑏
S5
8: 𝑛 2 ← number of logs in 𝐷 where 𝑎 is present but not 𝑏
9: if 𝑛 1 + 𝑛 2 ⩾ ⌈𝑓 /2⌉ + 1 then Figure 7. RecoverDurabilityLog Example. The figure shows how
10: add an edge from 𝑎 to 𝑏 in 𝐺 RecoverDurabilityLog works. 𝑆 1 , the leader of the previous view view-1, has
failed; this is a view-change for view-2 for which 𝑆 2 is the leader.
11: 𝑁 𝐿𝐷 ← TopologicalSort(𝐺)
⊲ 𝑁 𝐿𝐷 is the new leader’s durability log
and 𝑐; also, 𝑐 must appear after 𝑎 and 𝑏 in the recovered log.
Figure 6. RecoverDurabilityLog. The figure shows the procedure to The system must make progress with 𝑓 failures; thus, the
recover the durability log at the leader during a view change. procedure must correctly recover the durability log with
𝑓 + 1 replicas participating in a view change. As in VR, upon
most up-to-date† one among the 𝑓 +1 logs (including its own). receiving 𝑓 +1 DoViewChange messages, the leader first finds
The leader then sends a StartView message to the replicas in the highest normal view from the responses and considers
which it includes its log; the leader sets its status as normal. all durability logs in that view; we denote this set of logs as 𝐷
The replicas set their consensus log as the one sent by the (line 2). For example, in Figure 7(i), 𝑆 2 , 𝑆 3 , and 𝑆 4 participate
leader after which they set their status as normal. S KYROS in the view change and the last normal view of all replicas is 1.
uses exactly the same procedure to recover operations that Therefore, 𝐷 2 , 𝐷 3 , and 𝐷 4 are part of 𝐷. To recover completed
have been finalized (i.e., operations in the consensus log). operations, the leader then checks which operations appear
Thus, finalized operations are safely recovered as in VR. in at least ⌈𝑓 /2⌉ + 1 logs in 𝐷. Such operations are the ones
In S KYROS, the leader must additionally recover the dura- that the leader will recover as part of the new durability log;
bility log. The previous leader’s durability log would have we denote this set as 𝐸 (line 3). For example, in Figure 7(i),
contained all completed operations. Further, the previous 𝑎, 𝑏, 𝑐, and 𝑑 are part of 𝐸 (as they all appear in ⩾ 2 logs);
leader’s durability log would have contained the completed similarly, in Figure 7(ii), 𝑎, 𝑏, and 𝑐 are part of 𝐸.
operations in the correct real-time order, i.e., if an operation The above steps give the operations that form the dura-
𝑎 had completed before 𝑏, then 𝑎 would have appeared be- bility log, but not the real-time order among them. To deter-
fore 𝑏. These same guarantees must be preserved in the new mine the order, the leader considers every pair of operations
leader’s durability log during a view change. < 𝑥, 𝑦 > in 𝐸, and counts the number of logs where 𝑥 appears
S KYROS replicas send their durability logs as well in the before 𝑦 or 𝑥 appears but 𝑦 does not. If this count is at least
DoViewChange message. However, it is unsafe for the new ⌈𝑓 /2⌉ + 1, then the leader determines that 𝑦 follows 𝑥 in real
leader to take one log in the responses as its durability log; time. In Figure 7(ii), 𝑎 appears before 𝑐 on ⩾ 2 logs and so
a single log may not contain all completed operations. Con- the leader determines that 𝑐 follows 𝑎. In contrast, 𝑎 does
sider three completed updates 𝑎, 𝑏, and 𝑐, and let the dura- not appear before 𝑏 (or vice versa) in ⩾ 2 logs and thus are
bility logs be < 𝐷 1 : 𝑎𝑏𝑐, 𝐷 2 : 𝑎𝑐, 𝐷 3 : 𝑎𝑏𝑐, 𝐷 4 : 𝑎𝑏, 𝐷 5 : 𝑏𝑐 >. concurrent. Thus, this step gives only a partial order.
If 𝑆 2 , 𝑆 4 , and 𝑆 5 participate in a view change, no single log The leader constructs the total order as follows. It first
would contain all completed operations. Even if a single dura- adds all operations in 𝐸 as vertices in a graph, 𝐺 (lines 4–5).
bility log has all completed operations, it may not contain Then, for every pair of vertices < 𝑎, 𝑏 > in 𝐺, an edge is added
them in the correct real-time order. Consider 𝑎 completes between 𝑎 and 𝑏 if on at least ⌈𝑓 /2⌉ + 1 logs, either 𝑎 appears
before 𝑏 starts, and 𝑐 is incomplete and let the durability logs before 𝑏, or 𝑎 is present but not 𝑏 (lines 6–10). 𝐺 is a DAG
be < 𝐷 1 : 𝑎𝑏, 𝐷 2 : 𝑎𝑏, 𝐷 3 : 𝑏𝑎𝑐, 𝐷 4 : 𝑎𝑏, 𝐷 5 : 𝑎𝑏 >. If 𝑆 2 , 𝑆 3 , whose edges capture the real-time order between operations.
and 𝑆 4 participate in a view change, although 𝐷 3 contains To arrive at the total order, the leader topologically sorts 𝐺
all completed operations, taking 𝐷 3 as the leader’s log will (line 11) and uses the result as its durability log (𝑁 𝐿𝐷). In
violate linearizability because 𝑏 appears before 𝑎 in 𝐷 3 . Figure 7(ii), both 𝑏𝑎𝑐 and 𝑎𝑏𝑐 are valid total orders.
To correctly recover the durability log, a S KYROS leader The leader then appends the operations from the durabil-
uses the RecoverDurabilityLog procedure (Figure 6). We use ity log to the consensus log; duplicate operations are filtered
Figure 7 to illustrate how this procedure works. In this exam- using sequence numbers. Then, the leader sets its status
ple, 𝑓 =2; operations 𝑎, 𝑏, and 𝑐 completed, while 𝑑 did not. 𝑎 as normal. The leader then sends the consensus log in the
and 𝑏 were concurrent with each other, and 𝑐 started after StartView message to the replicas (similar to VR). The fol-
𝑎 and 𝑏 completed. Thus, the new leader must recover 𝑎, 𝑏, lowers, on receiving StartView, replace their consensus logs
with the one sent by the leader and set their status to normal.
† i.e.,
the log from a replica with the largest normal view; if many replicas The system is now available to accept new requests.
have the same normal view, the largest log among them is chosen.
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

4.7 Correctness finalized operations (in the consensus log) survives across
We now show that S KYROS is correct. Two correctness condi- views; any operation committed to the consensus log will
tions must be met. C1: all completed and finalized operations survive in the same position.
remain durable, C2: all operations are applied in the lineariz- Next, we show that the linearizable order of completed-
able order and an operation finalized to a position survives but-not-finalized operations is preserved. As before, we need
in the same position. The proof sketch is as follows. to consider only operations that were completed but not yet
C1. Ensuring durability when the leader is alive is straight- finalized in 𝑣 ′; remaining operations will be recovered as part
forward; a failed replica can recover its state from the leader. of the consensus log. We now show that for any two com-
Durability must also be ensured during view changes; the pleted operations 𝑥 and 𝑦, if 𝑦 follows 𝑥 in real time, then 𝑥
new leader must recover all finalized and completed opera- will appear before 𝑦 in the new leader’s recovered durability
tions. Finalized operations are part of at least 𝑓 + 1 consensus log. Let 𝐺 be a graph containing all completed operations as
logs. Thus, at least one among the 𝑓 + 1 replicas participat- its vertices. Assume that for any pair of operations < 𝑥, 𝑦 >,
ing in the view change is guaranteed to have the finalized a directed edge from 𝑥 to 𝑦 is correctly added to 𝐺 if 𝑦 fol-
operations and thus will be recovered (this is similar to VR). lows 𝑥 in real time (A1). Next assume that 𝐺 is acyclic (A2).
Next we show that completed operations that have not If A1 and A2 hold, then a topological sort of 𝐺 ensures that
been finalized are recovered. Let 𝑣 be the view for which a 𝑥 appears before 𝑦 in the result of the topological sort. We
view change is happening and the highest normal view be show that A1 and A2 are ensured by S KYROS.
𝑣 ′. We first establish that any operation that completed in 𝑣 ′ A1: Consider two completed operations 𝑎 and 𝑏 and that
will be recovered in 𝑣. Operations are written to 𝑓 + ⌈𝑓 /2⌉ + 1 𝑏 follows 𝑎 in real time. Since 𝑎 completed before 𝑏, when
durability logs before they are considered completed and 𝑏 starts, 𝑎 must have already been present on at least 𝑓 +
are not removed from the durability logs before they are ⌈𝑓 /2⌉ + 1 durability logs; let this set of logs be 𝐷𝐿. Now, for
finalized. Therefore, among the 𝑓 + 1 replicas participating each log 𝑑𝑙 in 𝐷𝐿, if 𝑏 is written to 𝑑𝑙, then 𝑏 would appear
in the view change for 𝑣, a completed operation in 𝑣 ′ will after 𝑎 in 𝑑𝑙. If 𝑏 is not written to 𝑑𝑙, then 𝑎 would appear
be present in at least ⌈𝑓 /2⌉ + 1 durability logs. Because the in 𝑑𝑙 but not 𝑏. Thus, 𝑎 appears before 𝑏 or 𝑎 is present but
new leader checks which operations are present in at least not 𝑏 on at least 𝑓 + ⌈𝑓 /2⌉ + 1 durability logs. Consequently,
⌈𝑓 /2⌉ + 1 logs (line 2 in Figure 6), operations completed in among the 𝑓 + 1 replicas participating in view change, on at
𝑣 ′ that are not finalized will be recovered as part of the new least ⌈𝑓 /2⌉ + 1 logs, 𝑎 appears before 𝑏 or 𝑎 is present but
leader’s durability log. not 𝑏. Because the leader adds an edge from 𝑎 to 𝑏 when
We next show that operations that were completed in this condition is true (lines 7–9 in Figure 6) and because it
an earlier view 𝑣 ′′ will also survive into 𝑣. During the view considers all pairs, A1 is ensured. A2: Since ⌈𝑓 /2⌉ + 1 is a
change for 𝑣 ′, the leader of 𝑣 ′ would have recovered the op- majority of 𝑓 + 1, an opposite edge from 𝑏 to 𝑎 would not be
erations completed in 𝑣 ′′ as part of its durability log (by the added to 𝐺. Since all pairs are considered, 𝐺 is acyclic.
same argument above). Before the view change for 𝑣 ′ com- A completed operation is assigned a position only when
pleted, the leader of 𝑣 ′ would have added these operations it is finalized. Since S KYROS adds an operation from the
from its durability log to the consensus log. Any node in durability log to the consensus only if it is already not present
the normal status in view 𝑣 ′ would thus have these opera- in the consensus log, a completed operation is finalized only
tions in its consensus log. Consensus-log recovery would once, after which it survives in the finalized position.
ensure these operations remain durable in successive views Model Checking. We have modeled the request-processing
including 𝑣. and view-change protocols in S KYROS, and model checked
C2. During normal operation, the leader’s durability log them. We explored over 2M states, in which the above correct-
reflects the real-time order. The leader adds operations to its ness conditions were met. Upon modifying the specification
consensus log only in order from its durability log. Before in subtle but wrong ways, our model checker finds safety
an (non-nilext) operation is directly added to the consensus violations. For example, in the RecoverDurabilityLog proce-
log, all prior operations in the durability log are appended dure, an edge is added from 𝑎 to 𝑏 when 𝑎 appears before 𝑏 in
to the consensus log as well. Thus, all operations in the ⌈𝑓 /2⌉ + 1 logs; if this threshold is increased, then a required
consensus log reflect the linearizable order. Reads are served edge will not be added, leading to a linearizability violation
by the leader which is guaranteed to have all acknowledged that the checker correctly flags; decreasing the threshold
operations; thus, any read to an object will include the effect makes 𝐺 cyclic, triggering a violation. Similarly, the checker
of all previous operations. This is because the leader ensures finds a safety violation if durability-log entries are not added
that any pending updates that the read depends upon are to consensus log before sending StartView.
applied in a linearizable order before the read is served.
The correct order must also be maintained during view 4.8 Practical Issues and Solutions
changes. Similar to VR, the order established among the We now describe a few practical problems we handled in
S KYROS. We also discuss possible optimizations.
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

Space and Catastrophic Errors. Because nilext updates • How does S KYROS perform on mixed workloads? (§5.2)
are not immediately executed, certain errors cannot be de- • How do read-latest percentages affect performance? (§5.3)
tected. For instance, an operation can complete but may fail • Does the supermajority requirement in S KYROS impact
later when applied to the storage system due to insufficient performance with many replicas? (§5.4)
space. A protocol that immediately executes operations, in • How does S KYROS perform on YCSB workloads? (§5.5)
theory, could propagate such errors to clients. However, such • Does replicated RocksDB benefit from S KYROS? (§5.6)
space errors can be avoided in practice by using space wa- • Does S KYROS offer benefit over commutative protocols?
termarks that the replication layer has visibility into; once Is nil-externality compatible with commutativity? (§5.7)
a threshold is hit, the replication layer can throttle updates Setup. We run our experiments on five replicas; thus, f=2
while the storage system reclaims space. One cannot, how- and supermajority=4. Each replica runs on a m5zn bare-metal
ever, anticipate catastrophic memory or disk failures. For- instance [5] in AWS (US-East). Numbers reported are the
tunately, this is not a major concern in practice. Given the average over three runs. Our baseline is VR/multi-paxos
inherent redundancy, a S KYROS replica transforms such er- which implements batching to improve throughput (denoted
rors into a crash failure; it is unlikely that all replicas will as Paxos). S KYROS also uses batching for background work.
encounter the same error. Note that these are errors that are Most of our experiments use a hash-table-based key-value
not part of the nilext interface contract. S KYROS checks for store; however, we also show cases with RocksDB.
all validation errors in the MakeDurable upcall. 5.1 Microbenchmark: Nilext-only Workload
Determining Nil-externality. While it is straightforward
We first compare the performance for a nilext-only workload.
in many cases to determine whether or not an interface is
Figure 8(a) plots the average latency against the throughput
nilext, occasionally it is not. For instance, a database update
when varying the number of clients. We also compare to
may invoke a trigger which can externalize state. However,
a no-batch Paxos variant in this experiment. In all further
when unsure, clients can safely choose to say that an inter-
experiments, we compare only against Paxos with batching.
face is non-nilext, forgoing some performance for safety.
We make three observations from the figure. First, S KYROS
Replica-group Configuration and Slow Path. In our im-
and Paxos offer ~3× higher throughput than the Paxos no-
plementation, clients know the addresses of replicas from
batch variant. Second, with a small number of clients, S KYROS
a configuration value. During normal operation, S KYROS
offers ~2× better latency and throughput than Paxos with
clients contact all replicas in the group and wait for a super-
batching. Batching across many clients improves the through-
majority responses to complete nilext writes. If the system is
put of Paxos. However, this affects latency: at about 100
operating with a bare majority, then writes cannot succeed,
KOps/s, S KYROS offers 3.1× lower latency than Paxos.
affecting availability. S KYROS handles this situation using a
slow path: after a handful of retries, clients mark requests 5.2 Microbenchmark: Mixed Workloads
to be non-nilext and send it to the leader. These requests We next consider mixed workloads. We use 10 clients.
are acknowledged after they are committed to a majority Nilext and non-nilext writes. Figure 8(b)(i) shows the re-
consensus logs, allowing clients to make progress. sult for a workload with a mix of nilext and non-nilext
Possible Optimizations. In S KYROS, requests are initially writes. With low non-nilext fractions, S KYROS offers 2×
stored in the durability log. The leader later adds the requests higher throughput because most writes complete in 1 RTT.
to its consensus log and replicates the consensus log. Our As the non-nilext fraction increases, the benefits of S KYROS
current implementation sends the requests in their entirety reduces. However, even in the worst case where all writes
during background replication. This is unnecessary in most are non-nilext, S KYROS does not perform worse than Paxos.
cases because the replicas already contain the request in As noted earlier, in many deployments, the fraction of non-
their durability logs. A more efficient way would be to send nilext writes is low and thus S KYROS would offer benefit; for
only the ordering information (i.e., the sequence numbers). example, with 10% non-nilext writes, S KYROS offers ~78%
Second, locally, a copy between the durability log and the higher throughput.
consensus log can be avoided if the entries are stored in a Nilext and reads. We next consider a workload with nilext
separate location and the log slots point to the entries. Finally, writes and reads. In S KYROS, if a read accesses a key for
S KYROS allows reads only at the leader; the burden on the which there are unfinalized updates, the read will incur 2
leader can be alleviated by using techniques such as quorum RTTs. We thus consider two request distributions: uniform
reads [12] without impacting linearizability. We leave these and zipfian. We vary the percentage of writes (W) and show
optimizations as an avenue for future work. the mean and p99 latency in Figure 8(b)(ii). In the uniform
case, operations do not often access the same keys and thus
5 Evaluation reads rarely incur 2 RTTs. With a low W, S KYROS offers only
To evaluate S KYROS, we ask the following questions: little benefit with mean latency (e.g., 10% lower mean latency
• How does S KYROS perform compared to standard replica- with 10% writes). However, S KYROS reduces p99 latency by
tion protocols on nilext-only workloads? (§5.1) 80% because writes are faster and reads rarely incur 2 RTTs.
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

Paxos (no-batch) Paxos (mean) Skyros (mean)

Throughput (Kops/s)

Throughput (Kops/s)
100
Paxos Skyros 1.95 1.85 Paxos (p99) Skyros (p99) 120 1.04 1.33
80 1.78 1000 1.52 1.60
800 1.56 uniform zipfian 90 1.72

Latency (us)
Latency (us)

60 1.23 750
600 60
40 500
400 Paxos 30 Paxos
20 250
200 Skyros Skyros
0 0 0
0 1 10 100 10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
0 50 100 Non-nilext write % Write percentage (%) Write percentage (%) Write percentage (%)
Throughput (Kops/s)
(i) nilext+nonnilext
(ii) nilext+reads (iii) mixed all
(a) Nilext-only (b) Mixed
Figure 8. Microbenchmark: Different Workload Mixes. (a) compares the performance of Skyros to Paxos for a nilext-only workload. (b) shows
the performance under three different mixed workloads (nilext+nonnilext, nilext+reads, and nilext+nonnilext+reads).

200
of the graph). S KYROS offers ~70% lower latency than Paxos.
Paxos Skyros
300 As we increase percentage of reads accessing items updated
Latency (us)

150
Latency (us)

in the window, more reads incur 2 RTTs and thus the av-
200
100 erage latency increases. Moreover, latency increases more
Paxos
Skyros [0-100us] 100 steeply for smaller windows; for example, when all reads go
50 Skyros [0-200us]
Skyros [0-1ms] 0
to items updated in the last 100 us, many reads (~68%) incur
0 5 7 9 2 RTTs. Again, not all reads incur 2 RTTs because of back-
1 10 100 No. of replicas
% reads to window ground finalization and prior reads to the items that force
Figure 10. Latency with synchronous ordering. In common workloads, where reads
Figure 9. Read-latest. The fig- Many Replicas. The figure
ure shows the performance of Skyros compares the average latency of do not often access recently written items, S KYROS offers
with varying read-latest percentages. Skyros for different cluster sizes. advantages. For example, with 10% reads accessing items
updated in last 100 us, S KYROS offers 70% lower latency.
With a high W (90%), S KYROS offers significant benefit: it
reduces mean latency by 2.2× and p99 latency by 4.1×. 5.4 Microbenchmark: Latency with Many Replicas
In the zipfian case, some keys are more popular than others. In prior experiments, we use five replicas and thus clients
Therefore, reads may often access keys recently modified by wait for four responses. With larger clusters, S KYROS clients
writes. Thus, as shown, p99 latency in S KYROS for zipfian must wait for many responses (e.g., seven responses with
increases compared to the uniform case. However, not all nine replicas), potentially increasing latency. To examine
reads incur 2 RTTs because of background finalization and this, we conduct an experiment with seven and nine replicas
prior reads that force synchronous ordering. Thus, although and measure the latencies for a nilext-only workload with 10
the improvements decrease compared to the uniform case, clients. As shown in Figure 10, the additional responses do
S KYROS still offers significant benefit over Paxos (e.g., at W not add much to the latencies; latencies in the seven and nine-
= 90%, mean and p99 latencies in S KYROS are 2× lower). node configurations are similar to that of the five-replica case
Writes and reads. We next run a mixed workload with all (about 110𝜇s) and is about 2× lower than Paxos.
three kinds of operations. We vary the write percentage (W) Microbenchmark Summary. S KYROS offers benefit under
and fix the non-nilext fraction to be 10% of W. As shown many workloads with different request ratios and distribu-
in Figure 8(b)(iii), with a small fraction of writes, S KYROS tions. Even when pushed to extreme cases (e.g., all non-nilext
offers little benefit over Paxos because reads take 1 RTT in or all reads access recent writes), S KYROS does not perform
both systems. With a higher W, S KYROS offers higher perfor- worse than Paxos. Under realistic workloads, S KYROS offers
mance; for example, with W=90% (9% non-nilext), S KYROS higher throughput, and lower mean and tail latencies.
offers 1.72× higher throughput.
5.5 YCSB Macrobenchmark
5.3 Microbenchmark: Read Latest We next analyze performance under six ycsb [16] workloads:
If many reads access recently modified items, then S KYROS Load (write-only), A (50% w, 50% r), B (5% w, 95% r), C (read-
would incur overhead on reads. To show this, we run a work- only), D (5% w, 95% r), and F (50% rmw, 50% r). Figure 11(a)
load with 50% nilext writes and 50% reads with 10 clients. We shows the result for 10 clients. For write-heavy workloads
vary the amount of reads that access items that were updated (load, A, and F), S KYROS improves throughput by 1.43× to
within three different windows [0-100] us (roughly 1 RTT 2.29×. S KYROS offers similar performance for the read-only
on our testbed), [0-200] us (roughly 2 RTTs), and [0-1] ms (a workload. For read-heavy workloads (B and D), S KYROS
large window), and measure the average request latency. offers little benefit; only 5% of operations can be made faster.
Figure 9 shows the result. Intuitively, if no or few reads ac- To understand the effect of reads that trigger synchro-
cess recently modified items, then performance of S KYROS nous ordering, we examine the read-latency distributions
would not be affected by reads taking 2 RTTs (leftmost point (Figure 11(b) and (d)). In both ycsb-a and ycsb-b, most reads
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

100 100
close-up close-up 100 close-up 100 close-up
Paxos Skyros 80 80 80 99.9
99 ~2x 99 0.3% 80 99
Throughput (Kops/s)

1.03 0.98 1.04 97


4%
97
1.7x 99.5 96 1.7x

CDF
1.43 60 60 60 60
120 2.29 1.62 95 95 99.1 93
90 40 0 300 600 40 0 300 600 40 0 300 600 40 0 100 200
60 20 Paxos 20 Paxos 20 Paxos 20 Paxos
Skyros Skyros Skyros Skyros
30 0 0 0 0
0 0 300 600 900 0 300 600 900 0 150 300 450 600 0 150 300 450 600
LOAD A B C D F Latency (us) Latency (us) Latency (us) Latency (us)
(a) Throughput (b) ycsb-a read (c) ycsb-a overall (d) ycsb-b read (e) ycsb-b overall
Figure 11. YCSB Performance. (a) show the throughput for all ycsb workloads; (b) and (d) show the read-latency distribution for ycsb-a and ycsb-b,
respectively; (c) and (e) show the operation-latency distribution for the same workloads.

Paxos Skyros Paxos Skyros


2.14
5.7 Comparison to Commutative Protocols
Latency (us)

250
Throughput (Kops/sec)
1.79 We now compare S KYROS to commutative protocols. We
200 100
150 1.38 1.32 1.50
1.83 compare against Curp [64], a recent protocol that improves
100 80
50 1.0 over prior commutative protocols. Curp targets primary-
0 60
A B D F 1.0
@120 @155 @158 @128 40 backup, but sketches the protocol for consensus [64, §Appendix-
Kops/s Kops/s Kops/s Kops/s 20 B.2]. In this protocol, a client sends an update 𝑢 to all replicas;
Figure 12. S KYROS Latency Bene- 0
LOAD A
each replica adds 𝑢 to a witness component if 𝑢 commutes
fits. The figure compares the average latency with prior operations in the witness. The leader adds 𝑢 to the
at maximum throughput for mixed YCSB Figure 13. RocksDB. log, executes 𝑢 speculatively, and returns a response. Clients
workloads. The number below each bar shows The figure shows performance wait for a supermajority responses (including the leader’s
the throughput for the workload. in RocksDB.
result). If the leader detects a conflict, it initiates a sync, finish-
complete in 1 RTT, while some incur overhead. However, this ing the operation in 2 RTT. If a conflict arises at the followers,
fraction is very small (e.g., 4% in ycsb-a and 0.3% in ycsb-b; the client detects that and informs the leader to initiate a
we see similar fractions for other workloads too). However, sync; such requests take 3 RTTs. Reads are sent only to the
the slow reads do not affect the overall p99 latency. In fact, leader and thus would incur only 2 RTT upon conflicts. We
examining the distribution of operation (both read and write) implement this protocol and call our implementation Curp-c.
latencies shows that S KYROS reduces the overall p99 latency.
This reduction arises because the tail in the overall workload 5.7.1 Benefits over Commutative Protocols. We first
includes expensive writes in Paxos, which S KYROS makes compare S KYROS and Curp-c under a write-only key-value
faster. As a result, S KYROS reduces overall p99 latency by workload (only set). Figure 14(a) shows the result. In the no-
1.7× in ycsb-a and ycsb-b as shown in Figure 11(c) and (e). conflict case (no two writes access the same key), Curp-c and
Latency Benefits. For a fixed number of clients as in the S KYROS perform similarly and are 2× faster than Paxos. In
previous experiment, S KYROS offers higher throughput than Curp-c, all requests take 1 RTT because no request conflicts
Paxos. This is because, in baseline Paxos, the leader waits for with another. In S KYROS, all operations are nilext and so
requests to be ordered in 2-RTTs. While S KYROS defers this complete in 1 RTT. However, for a zipfian workload (𝜃 = 0.99,
ordering work, it does not avoid it. However, by moving the the default in YCSB), Curp-c’s performance drops due to
ordering-wait in Paxos to the background, S KYROS is able to conflicts, while S KYROS maintains the high performance. In
use the otherwise idle CPU cycles to accept more requests; this case, S KYROS offers 2.7× lower p99 latency than Curp-c.
this enables S KYROS to achieve higher throughput. We next run ycsb-a (50%w, 50%r). As shown in Figure 14(b),
Paxos, with batching across many clients, can achieve Paxos reads take 1 RTT. In S KYROS, a small fraction of reads
high throughput levels (similar to S KYROS). However, at take 2 RTTs. A similar fraction of reads in Curp-c also con-
such high throughput, S KYROS offers significant latency flict with prior writes and thus incur 2 RTTs. As shown in
benefits. To illustrate this, we measure the average latency at Figure 14(c), nilext writes in S KYROS can always complete in
the maximum throughput obtained by Paxos for write-heavy 1 RTT. In contrast, in Curp-c, writes conflict with prior writes
(ycsb-a,f) and read-heavy (ycsb-b,d) workloads. As shown in and thus sometimes incur 2 or 3 RTTs. As a result, S KYROS
Figure 12, S KYROS offers 1.32×–2.14× lower latencies than offers 34% lower p99 latency. We observe that write-write
Paxos for the same throughput. conflicts in Curp-c lead to 50% more slow-path operations
5.6 Replicated RocksDB: Paxos vs. S KYROS than read-write conflicts in S KYROS and Curp-c. A write-
We have also integrated RocksDB with S KYROS. We built a write conflict can arise due to unsynced operations on any
wrapper around RocksDB in which we implemented the up- replica, whereas a read-write conflict can occur only at the
calls. Figure 13 compares the performance under two work- leader. Further, the followers’ knowledge of synced opera-
loads when using S KYROS and Paxos to replicate RocksDB. tions is behind the leader by a message delay, increasing the
As before, S KYROS offers notable improvements. conflict window at the followers.
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

100
Paxos Curp-c Skyros Paxos Curp-c Skyros Skyros-comm
Throughput (Kops/sec)

Throughput (Kops/sec)

Throughput (Kops/sec)
95 50 2.1
2.3 2.29 2.28 2.31 2.33 2.09

CDF
100 40 100
1.88 1.87 1.79
80 90 Paxos Paxos 80 1.8
30
60 Skyros Skyros 1.0 60
1.0 1.0 85 0.8 1.0 1.0
40 Curp-c Curp-c 20
40
20 80 10 20
0 0 300 600 900 0 300 600 900 0 0
no-conflict zipfian Latency (us) Latency (us) Paxos Curp-c Skyros no-conflict zipfian

(a) Key-value write-only (b) ycsb-a read (c) ycsb-a write (d) Record append (e) S KYROS -C OMM Benefits
Figure 14. Comparison to Commutativity. (a) shows the write throughput in kv-store. (b) and (c) show the latencies for ycsb-a. (d) compares
record-append throughput. (e) shows the kv-store throughput for a nilext + non-nilext workload.

Exploiting nil-externality offers benefit over commutativ- commits non-nilext writes faster than S KYROS. Figure 14(e)-
ity when operations do not commute. To show this, we built zipfian case shows that Curp-c’s performance reduces due to
a file store that supports GFS-style record appends [32]. The conflicts. S KYROS performs similar to Curp-c because of the
record-append interface is not commutative: records must 10% non-nilext writes. S KYROS -C OMM, however, improves
be appended in the same order across replicas. However, it is performance over S KYROS and Curp-c by combining the
nilext: it just returns a success. Figure 14(d) shows the result advantages of nil-externality and commutativity.
when four clients append records to a file. Because every
operation conflicts, Curp-c’s performance drops; it is lower
than Paxos because some requests take 3 RTTs. S KYROS
6 Discussion
offers 2× higher throughput than Paxos and Curp-c. In this paper, we exploit nilext interfaces in the context of
leader-based replication for key-value stores. Further, our
evaluation focused on single-datacenter settings. However,
5.7.2 Augmenting with Commutativity. While S KYROS the general idea of exploiting nil-externality can be applied in
offers performance advantages over Curp-c in many cases, other contexts as well. We discuss such possible extensions.
non-nilext updates can reduce the performance of S KYROS. Beyond Key-value Stores. Key-value stores (especially ones
Curp-c can complete such operations in 1 RTT (when they do built atop write-optimized structures) have many nilext in-
not conflict). Figure 14(e)-no-conflict case shows this: with terfaces, enabling fast replication. Nil-externality can be ex-
10% non-nilext writes, Curp-c performs better than S KYROS. ploited to perform fast replication for other systems such as
Fortunately, however, nil-externality is compatible with databases and file systems as well. As an example, consider
commutativity. We build S KYROS -C OMM, a variant of S KYROS the POSIX file API. Writes in POSIX (i.e., the write system
that exploits commutativity to speed up non-nilext opera- call, and variants like pwrite and O_APPEND writes) are
tions. S KYROS -C OMM handles nilext writes and reads in the nilext because they do not externalize state, barring cata-
same way as S KYROS. However, non-nilext writes are han- strophic I/O errors (e.g., due to a bad disk). Writes can thus
dled similar to Curp-c. Upon a non-nilext write, a replica be replicated performantly. Further, some file systems have
checks for conflicts with the pending nilext and non-nilext been built upon write-optimized structures [26, 39], making
writes. If there are none, similar to curp-c, the replicas add most file-system operations nilext by design. A nilext-aware
this operation to their durability logs. Since non-nilext oper- protocol can enable fast replication for such file systems.
ations expose state, the leader also executes the operation Leaderless Protocols. S KYROS is a leader-based protocol.
and returns the result. Clients wait for supermajority re- The leader can become a performance bottleneck in such
sponses including the execution result from the leader and leader-based protocols. Also, clients cannot make progress
acknowledgments from other replicas. Similar to S KYROS, when the leader fails (before a new leader is chosen). Lead-
these responses must be from the same view. erless protocols [54, 58] allow any replica to accept requests,
S KYROS -C OMM handles non-nilext-write conflicts in 2 or leading to better performance and availability. The idea of
3 RTTs. A conflicting non-nilext write at the leader is treated exploiting nil-externality can be applied to such leaderless
similar to a read that accesses a pending update, finishing protocols as well. Leaderless protocols such as EPaxos [58]
the operation in 2 RTTs. If the conflict does not arise at the exploit commutativity to commit requests in one WAN RTT
leader but at the followers, the client detects the conflict and in geo-replicated settings. However, conflicting writes incur
resends the request to the leader. The leader then enforces additional roundtrips. Such a protocol can be augmented to
order by committing the request (and prior ones) to other exploit nil-externality to avoid resolving conflicts on nilext
replicas, finishing the operation in a total of 3 RTTs. Note writes and do so only on non-nilext writes or reads.
that S KYROS -C OMM does not check for conflicts for nilext Multi Datacenter Settings. Unlike protocols designed for
writes because they are ordered and executed only lazily. the data center [50, 67], S KYROS is applicable to geo-replicated
The last bar in Figure 14(e)-no-conflict case shows that settings as well. By avoiding one WAN RTT, S KYROS can re-
S KYROS -C OMM matches Curp-c’s performance because it duce latency for nilext operations significantly. However, in
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

some scenarios, S KYROS may lead to higher latencies than a in WAN [78] and allows clients to choose between Multi-
traditional 2-RTT protocol. In particular, when a majority of Paxos and Fast-Paxos schemes. As discussed in §6, ideas
the replicas (but not a supermajority) are in the same region from Domino can be utilized in S KYROS to fall back to a
as the client, committing to a majority in two RTTs might 2-RTT path in geo-replicated scenarios where a single RTT
be cheaper than committing to a supermajority in one RTT. to a supermajority is more expensive than two RTTs to a
While such a deployment is not commonly used (for fault tol- majority. Prior work has also proposed other techniques to
erance reasons), when it is, S KYROS could be modified to fall realize high performance in multi-core servers [34, 40], by
back to the “slow” 2-RTT protocol based on measurements enabling quorum reads [12], and by partitioning state [47].
(similar to recent systems [78]). Such optimizations could also benefit S KYROS.
Local Storage Techniques. Techniques in S KYROS bear
7 Related Work similarities to database write-ahead logging (WAL) [57] and
Commit Before Externalize. Our idea of deferring work file-system journaling [35]. However, our techniques differ
until externalization bears similarity to prior systems. Xsyncfs in important aspects. While WAL and journaling do enable
defers disk I/O until output is externalized [60], essentially delaying writes to final on-disk pages, the writes are still
moving the output commit [25, 72] to clients. SpecPaxos [67], applied to in-memory pages before responding to clients.
Zyzzyva [42], and SpecBFT [77] do the same for replication. Further, background disk writes are not triggered by external-
As discussed in §3.4.1, these protocols execute requests in izing operations but rather occur asynchronously; externaliz-
the correct order before notifying the end application. Our ing operations can proceed by accessing the in-memory state.
approach, in contrast, defers ordering or executing nilext In contrast, S KYROS defers applying updates altogether until
operations beyond notifying the end application. externalization. While both WAL and the durability log in
State modified by nilext updates can be externalized by S KYROS ensure durability, WAL also imposes an order of
later non-nilext operations upon which S KYROS enforces the transactions. Group commit [23, 35] batches several updates
required ordering and execution. Occult [56] and CAD [31] to amortize disk-access costs; Multi-Paxos and S KYROS sim-
use a similar idea at a high-level. Occult defers enforcing ilarly use batching at the leader to amortize cost.
causal consistency upon writes and does so only when clients
read data. Similarly, CAD does not guarantee durability when 8 Conclusion
writes complete; writes are made durable only upon sub- In this paper, we identify nil-externality, a storage-interface
sequent reads [31]. However, these systems do not offer property, and show that this property is prevalent in storage
linearizability unlike S KYROS. Further, these systems defer systems. We design nilext-aware replication, a new approach
work on all updates unlike our work which defers work to replication that takes advantage of nilext interfaces to im-
based on whether or not the write is nilext. Prior work in prove performance by lazily ordering and executing updates.
unreplicated databases [30] realizes that some transactions We experimentally demonstrate that nilext-aware replication
only return an abort or commit and thus can be evaluated improves performance over existing approaches for a range
lazily, improving performance. Our work focuses on repli- of workloads. More broadly, our work shows that exposing
cated storage and identifies a general interface-level property and exploiting properties across layers of a storage system
that allows deferring ordering and execution. can bring significant performance benefit. Storage systems,
Exploiting Semantics. Inconsistent replication (IR) [80] re- today, layer existing replication protocols upon local storage
alizes that inconsistent operations only require durability, systems (such as key-value stores). Such black-box layer-
and thus can be completed in 1 RTT. Nilext operations, in ing masks vital information across these layers, resulting in
contrast, require durability and ordering. Further, IR cannot missed performance opportunities. This paper shows that by
support general state machines. Prior replication [45, 58, 64] making the replication layer aware of the underlying storage-
and transaction protocols [59] use commutativity to improve interface properties, higher performance can be realized.
performance. Nil-externality has advantages over and com- The source code of S KYROS and our experimental artifacts
bines well with commutativity (§5.7). S KYROS’s use of DAG are available at https://bitbucket.org/aganesan4/skyros/.
to resolve real-time order has a similar flavor to commutative
Acknowledgments. We thank Bernard Wong (our shep-
protocols [58, 59]). However, these protocols resolve order
herd) and the anonymous SOSP ’21 reviewers for their in-
in the common-case before execution; S KYROS needs such a
sightful comments. We thank the following VMware Re-
step only during view changes. Gemini [49] and Pileus [43]
search Group members for their invaluable discussions: Jon
realize that some operations need only weak consistency
Howell, Lalith Suresh, Marcos Aguilera, Mihai Budiu, Naama
and perform these operations faster; we focus on realizing
Ben-David, Rob Johnson, and Sujata Banerjee. Finally, the
strong consistency with high performance.
first two authors would like to extend special thanks to grand-
SMR Optimizations. Apart from the approaches in §3.4.1,
mother Jayanthy Alagappan for taking care of their toddler
prior systems have pushed consensus into the network [20,
daughter while they were working on this paper.
21]. Domino uses a predictive approach to reduce latency
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

References [18] James Cowling and Barbara Liskov. 2012. Granola: Low-overhead Dis-
[1] 2021. Memcached Commands. https://github.com/ tributed Transaction Coordination. In 2012 USENIX Annual Technical
memcached/memcached/wiki/Commands#set. Conference (USENIX ATC 12). Boston, MA.
[2] Hussam Abu-Libdeh, Robbert Van Renesse, and Ymir Vigfusson. 2013. [19] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and
Leveraging Sharding in the Design of Scalable Replication Protocols. Liuba Shrira. 2006. HQ Replication: A Hybrid Quorum Protocol for
In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’13). Byzantine Fault Tolerance. In Proceedings of the 7th Symposium on
Santa Clara, CA. Operating Systems Design and Implementation (OSDI ’06). Seattle, WA.
[3] Apache. 2021. ZooKeeper. https://zookeeper.apache.org/. [20] Huynh Tu Dang, Pietro Bressana, Han Wang, Ki Suh Lee, Noa Zil-
[4] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. 1995. Sharing Memory berman, Hakim Weatherspoon, Marco Canini, Fernando Pedone, and
Robustly in Message-passing Systems. Journal of the ACM (JACM) 42, Robert Soulé. 2020. P4xos: Consensus as a Network Service. IEEE/ACM
1 (1995), 124–142. Transactions on Networking 28, 4 (2020).
[5] AWS News Blog. 2020. New EC2 M5zn Instances – Fastest Intel [21] Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone,
Xeon Scalable CPU in the Cloud. https://aws.amazon.com/ and Robert Soulé. 2015. NetPaxos: Consensus at Network Speed. In
blogs/aws/new-ec2-m5zn-instances-fastest-intel- Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined
xeon-scalable-cpu-in-the-cloud/. Networking Research (SOSR ’15). Santa Clara, CA.
[6] Michael A Bender, Martin Farach-Colton, William Jannen, Rob John- [22] Denis Serenyi. [n. d.]. Cluster-Level Storage @ Google.
son, Bradley C Kuszmaul, Donald E Porter, Jun Yuan, and Yang Zhan. http://www.pdsw.org/pdsw-discs17/slides/PDSW-
2015. An Introduction to Be-trees and Write-optimization. USENIX DISCS-Google-Keynote.pdf.
;login: 40, 5 (2015), 22–28. [23] David J DeWitt, Randy H Katz, Frank Olken, Leonard D Shapiro,
[7] William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Michael R Stonebraker, and David A. Wood. 1984. Implementation
Kusters, and Peng Li. 2011. Paxos Replicated State Machines As the Techniques for Main Memory Database Systems. In Proceedings of the
Basis of a High-performance Data Store. In Proceedings of the 8th 1984 ACM SIGMOD Conference on the Management of Data (SIGMOD
Symposium on Networked Systems Design and Implementation (NSDI ’84). Boston, MA.
’11). Boston, MA. [24] Effi Ofer, Danny Harnik, and Ronen Kat. 2021. Object Storage
[8] Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower Bounds for Traces: A Treasure Trove of Information for Optimizing Cloud
External Memory Dictionaries.. In SODA, Vol. 3. Workloads. https://www.ibm.com/cloud/blog/object-
[9] Navin Budhiraja, Keith Marzullo, Fred B Schneider, and Sam Toueg. storage-traces.
1993. The Primary-backup Approach. Distributed systems 2 (1993). [25] Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and
[10] Matthew Burke, Audrey Cheng, and Wyatt Lloyd. 2020. Gryff: Unifying David B Johnson. 2002. A survey of rollback-recovery protocols in
Consensus and Shared Registers. In Proceedings of the 17th Symposium message-passing systems. ACM Computing Surveys (CSUR) 34, 3 (2002),
on Networked Systems Design and Implementation (NSDI ’20). Santa 375–408.
Clara, CA. [26] John Esmet, Michael A. Bender, Martin Farach-Colton, and Bradley C.
[11] Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du. 2020. Kuszmaul. 2012. The TokuFS Streaming File System. In 4th Workshop
Characterizing, Modeling, and Benchmarking RocksDB Key-Value on Hot Topics in Storage and File Systems (HotStorage ’12). Boston,
Workloads at Facebook. In Proceedings of the 18th USENIX Conference Massachussetts.
on File and Storage Technologies (FAST ’20). Santa Clara, CA. [27] Facebook. 2016. MyRocks: A space- and write-optimized MySQL
[12] Aleksey Charapko, Ailidani Ailijiang, and Murat Demirbas. 2019. Lin- database. https://engineering.fb.com/2016/08/31/core-
earizable Quorum Reads in Paxos. In 11th USENIX Workshop on Hot data/myrocks-a-space-and-write-optimized-mysql-
Topics in Storage and File Systems (HotStorage ’19). Renton, WA. database/.
[13] David R Cheriton. 1987. UIO: A Uniform I/O System Interface for [28] Facebook. 2021. Merge Operator. https://github.com/
Distributed Systems. ACM Transactions on Computer Systems (TOCS) facebook/rocksdb/wiki/Merge-Operator.
5, 1 (1987). [29] Facebook. 2021. RocksDB. http://rocksdb.org/.
[14] Austin T Clements, M Frans Kaashoek, Nickolai Zeldovich, Robert T [30] Jose M Faleiro, Alexander Thomson, and Daniel J Abadi. 2014. Lazy
Morris, and Eddie Kohler. 2013. The Scalable Commutativity Rule: Evaluation of Transactions in Database Systems. In Proceedings of the
Designing Scalable Software for Multicore Processors. In Proceedings 2014 ACM SIGMOD International Conference on Management of Data
of the 24th ACM Symposium on Operating Systems Principles (SOSP ’13). (SIGMOD ’14). Snowbird, UT.
Farmington, Pennsylvania. [31] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-
[15] Alexander Conway, Abhishek Gupta, Vijay Chidambaram, Martin Dusseau, and Remzi H. Arpaci-Dusseau. 2020. Strong and Efficient
Farach-Colton, Richard Spillane, Amy Tai, and Rob Johnson. 2020. Consistency with Consistency-aware Durability. In Proceedings of the
SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores. 18th USENIX Conference on File and Storage Technologies (FAST ’20).
In 2020 USENIX Annual Technical Conference (USENIX ATC 20). Online. Santa Clara, CA.
[16] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, [32] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The
and Russell Sears. 2010. Benchmarking Cloud Serving Systems with Google File System. In Proceedings of the 19th ACM Symposium on
YCSB. In Proceedings of the ACM Symposium on Cloud Computing Operating Systems Principles (SOSP ’03). Bolton Landing, New York.
(SOCC ’10). Indianapolis, IA. [33] Sanjay Ghemawhat, Jeff Dean, Chris Mumford, David Grogan, and
[17] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christo- Victor Costan. 2011. LevelDB. https://github.com/google/leveldb.
pher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, [34] Zhenyu Guo, Chuntao Hong, Mao Yang, Dong Zhou, Lidong Zhou,
Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kan- and Li Zhuang. 2014. Rex: Replication at the Speed of Multi-core. In
thak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Proceedings of the EuroSys Conference (EuroSys ’14). Amsterdam, The
Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Ya- Netherlands.
sushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and [35] Robert Hagmann. 1987. Reimplementing the Cedar File System using
Dale Woodford. 2012. Spanner: Google’s Globally Distributed Data- Logging and Group Commit. In Proceedings of the 11th ACM Symposium
base. In Proceedings of the 10th Symposium on Operating Systems Design on Operating Systems Principles (SOSP ’87). Austin, Texas.
and Implementation (OSDI ’12). Hollywood, CA.
SOSP ’21, October 26–29, 2021, Virtual Event, Germany A. Ganesan, R. Alagappan, A. Arpaci-Dusseau, R. Arpaci-Dusseau

[36] Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A [54] Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius:
Correctness Condition for Concurrent Objects. ACM Trans. Program. Building Efficient Replicated State Machines for WANs. In Proceedings
Lang. Syst. 12, 3 (July 1990). of the 8th Symposium on Operating Systems Design and Implementation
[37] Paul Hudak. 1989. Conception, Evolution, and Application of Func- (OSDI ’08). San Diego, CA.
tional Programming Languages. ACM Computing Survey 21, 3 (1989). [55] Yoshinori Matsunobu, Siying Dong, and Herman Lee. 2020. MyRocks:
[38] IBM. 2021. Locations for Resource Deployment: Multizone Re- LSM-tree Database Storage Engine Serving Facebook’s Social Graph.
gions. https://cloud.ibm.com/docs/overview?topic= Proceedings of the VLDB Endowment 13, 12 (2020).
overview-locations#mzr-table. [56] Syed Akbar Mehdi, Cody Littley, Natacha Crooks, Lorenzo Alvisi,
[39] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Nathan Bronson, and Wyatt Lloyd. 2017. I Can’t Believe It’s Not
Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Causal! Scalable Causal Consistency with No Slowdown Cascades. In
Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley Proceedings of the 14th Symposium on Networked Systems Design and
C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A Right-optimized Implementation (NSDI ’17). Boston, MA.
Write-optimized File System. In Proceedings of the 14th USENIX Sym- [57] Chandrasekaran Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh,
posium on File and Storage Technologies (FAST ’15). Santa Clara, CA. and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Sup-
[40] Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo porting Fine-Granularity Locking and Partial Rollbacks using Write-
Alvisi, and Mike Dahlin. 2012. All About Eve: Execute-verify Replica- Ahead Logging. ACM Transactions on Database Systems (TODS) 17, 1
tion for Multi-core Servers. In Proceedings of the 10th Symposium on (1992), 94–162.
Operating Systems Design and Implementation (OSDI ’12). Hollywood, [58] Iulian Moraru, David G Andersen, and Michael Kaminsky. 2013. There
CA. is More Consensus in Egalitarian Parliaments. In Proceedings of the
[41] Bettina Kemme, Fernando Pedone, Gustavo Alonso, and André Schiper. 24th ACM Symposium on Operating Systems Principles (SOSP ’13). Farm-
1999. Processing transactions over optimistic atomic broadcast proto- ington, Pennsylvania.
cols. In International Symposium on Distributed Computing (DISC 99). [59] Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. 2016. Con-
Bratislava, Slovak Republic. solidating Concurrency Control and Consensus for Commits under
[42] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Conflicts. In Proceedings of the 12th USENIX Conference on Operating
Edmund Wong. 2007. Zyzzyva: Speculative Byzantine Fault Tolerance. Systems Design and Implementation (OSDI ’16). Savannah, GA.
In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 45–58. [60] Edmund B Nightingale, Kaushik Veeraraghavan, Peter M Chen, and
[43] Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera, and Jason Flinn. 2006. Rethink the sync. In Proceedings of the 7th Symposium
Doug Terry. 2013. Consistency-based Service Level Agreements for on Operating Systems Design and Implementation (OSDI ’06). Seattle,
Cloud Storage. In Proceedings of the 24th ACM Symposium on Operating WA.
Systems Principles (SOSP ’13). Farmington, Pennsylvania. [61] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Her-
[44] Leslie Lamport. 2001. Paxos Made Simple. ACM Sigact News 32, 4 man Lee, Harry C Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul
(2001), 18–25. Saab, et al. 2013. Scaling Memcache at Facebook. In Proceedings of
[45] Leslie Lamport. 2005. Generalized Consensus and Paxos. (2005). the 10th Symposium on Networked Systems Design and Implementation
[46] Butler W Lampson. 1983. Hints for Computer System Design. In (NSDI ’13). Lombard, IL.
Proceedings of the 9th ACM Symposium on Operating System Principles [62] Diego Ongaro and John Ousterhout. 2014. In Search of an Under-
(SOSP ’83). Bretton Woods, New Hampshire. standable Consensus Algorithm. In 2014 USENIX Annual Technical
[47] Long Hoang Le, Enrique Fynn, Mojtaba Eslahi-Kelorazi, Robert Soulé, Conference (USENIX ATC 14). Philadelphia, PA.
and Fernando Pedone. 2019. Dynastar: Optimized Dynamic Partition- [63] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil.
ing for Scalable State Machine Replication. In 2019 IEEE 39th Interna- 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica
tional Conference on Distributed Computing Systems (ICDCS ’19). Dallas, 33, 4 (1996).
TX. [64] Seo Jin Park and John Ousterhout. 2019. Exploiting Commutativity
[48] Collin Lee, Seo Jin Park, Ankita Kejriwal, Satoshi Matsushita, and John For Practical Fast Replication. In Proceedings of the 16th Symposium
Ousterhout. 2015. Implementing Linearizability at Large Scale and on Networked Systems Design and Implementation (NSDI ’19). Boston,
Low Latency. In Proceedings of the 25th ACM Symposium on Operating MA.
Systems Principles (SOSP ’15). Monterey, California. [65] Fernando Pedone and André Schiper. 2002. Handling Message Se-
[49] Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno mantics with Generic Broadcast Protocols. Distributed Computing
Preguiça, and Rodrigo Rodrigues. 2012. Making Geo-Replicated Sys- (2002).
tems Fast as Possible, Consistent when Necessary. In Proceedings of [66] Percona. 2013. Fast Updates with TokuDB. https://www.percona.
the 10th Symposium on Operating Systems Design and Implementation com/blog/2013/02/12/fast-updates-with-tokudb/.
(OSDI ’12). Hollywood, CA. [67] Dan RK Ports, Jialin Li, Vincent Liu, Naveen Kr Sharma, and Arvind
[50] Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Krishnamurthy. 2015. Designing Distributed Systems Using Approxi-
Dan RK Ports. 2016. Just Say No to Paxos Overhead: Replacing Con- mate Synchrony in Data Center Networks. In Proceedings of the 12th
sensus with Network Ordering. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI
Conference on Operating Systems Design and Implementation (OSDI ’16). ’15). Oakland, CA.
Savannah, GA. [68] Sudip Roy, Lucja Kot, and Christoph Koch. 2013. Quantum databases.
[51] Wei Lin, Mao Yang, Lintao Zhang, and Lidong Zhou. 2008. PacificA: In Proceedings of the 6th Biennial Conference on Innovative Data Systems
Replication in Log-based Distributed Storage Systems. Technical Report Research (CIDR 2013). Asilomar, CA.
MSR-TR-2008-25. [69] Stephen M Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum,
[52] Barbara Liskov and James Cowling. 2012. Viewstamped Replication and John K Ousterhout. 2011. It’s Time for Low Latency.. In The
Revisited. (2012). Thirteenth Workshop on Hot Topics in Operating Systems (HotOS XIII).
[53] Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Napa, CA.
Shrira, and Michael Williams. 1991. Replication in the Harp file sys- [70] Russel Sandberg. 1986. The Sun Network File System: Design, Im-
tem. In Proceedings of the 13th ACM Symposium on Operating Systems plementation and Experience. In Proceedings of the USENIX Summer
Principles (SOSP ’91). Pacific Grove, CA. Technical Conference (USENIX Summer ’86). Atlanta, GA.
Exploiting Nil-Externality for Fast Replicated Storage SOSP ’21, October 26–29, 2021, Virtual Event, Germany

[71] Fred B. Schneider. 1990. Implementing Fault-tolerant Services Using [77] Benjamin Wester, James A Cowling, Edmund B Nightingale, Peter M
the State Machine Approach: A Tutorial. ACM Comput. Surv. 22, 4 Chen, Jason Flinn, and Barbara Liskov. 2009. Tolerating Latency in
(December 1990), 299–319. https://doi.org/10.1145/98163. Replicated State Machines Through Client Speculation.. In Proceedings
98167 of the 6th Symposium on Networked Systems Design and Implementation
[72] Rob Strom and Shaula Yemini. 1985. Optimistic recovery in distributed (NSDI ’09). Boston, MA.
systems. ACM Transactions on Computer Systems (TOCS) 3, 3 (1985), [78] Xinan Yan, Linguan Yang, and Bernard Wong. 2020. Domino: Using
204–226. Network Measurements to Reduce State Machine Replication Latency
[73] Amy Tai, Andrew Kryczka, Shobhit O. Kanaujia, Kyle Jamieson, in WANs. In Proceedings of the 16th International Conference on emerg-
Michael J. Freedman, and Asaf Cidon. 2019. Who’s Afraid of Uncor- ing Networking EXperiments and Technologies.
rectable Bit Errors? Online Recovery of Flash Errors with Distributed [79] Juncheng Yang, Yao Yue, and K. V. Rashmi. 2020. A Large Scale Analy-
Redundancy. In Proceedings of the USENIX Annual Technical Conference sis of Hundreds of In-memory Cache Clusters at Twitter. In Proceedings
(USENIX ATC 19). Renton, WA. of the 14th USENIX Conference on Operating Systems Design and Imple-
[74] Twitter. 2012. Caching with Twemcache. https://blog. mentation (OSDI ’20). Banff, Canada.
twitter.com/engineering/en_us/a/2012/caching-with- [80] Irene Zhang, Naveen Kr Sharma, Adriana Szekeres, Arvind Krishna-
twemcache.html. murthy, and Dan RK Ports. 2015. Building consistent transactions with
[75] Twitter. 2020. Twitter Cache Trace. https://github.com/ inconsistent replication. In Proceedings of the 25th ACM Symposium on
twitter/cache-trace. Operating Systems Principles (SOSP ’15). Monterey, California.
[76] Robbert Van Renesse and Fred B Schneider. 2004. Chain Replication
for Supporting High Throughput and Availability. In Proceedings of
the 6th Symposium on Operating Systems Design and Implementation
(OSDI ’04). San Francisco, CA.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy