Diecast: Testing Distributed Systems With An Accurate Scale Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

DieCast: Testing Distributed Systems with an Accurate Scale Model

Diwaker Gupta, Kashi V. Vishwanath, and Amin Vahdat


University of California, San Diego
{dgupta,kvishwanath,vahdat}@cs.ucsd.edu

Abstract must be made from commodity components—saving an


extra $500 per node in a 100,000-node service is critical.
Large-scale network services can consist of tens of thou- Similarly, nodes run commodity operating systems, with
sands of machines running thousands of unique soft- only moderate levels of reliability, and custom-written
ware configurations spread across hundreds of physical applications that are often rushed to production because
networks. Testing such services for complex perfor- of the pressures of “Internet Time.” In this environment,
mance problems and configuration errors remains a dif- failure is common [24] and it becomes the responsibility
ficult problem. Existing testing techniques, such as sim- of higher-level software architectures, usually employing
ulation or running smaller instances of a service, have custom monitoring infrastructures and significant service
limitations in predicting overall service behavior. and data replication, to mask individual, correlated, and
Although technically and economically infeasible at cascading failures from end clients.
this time, testing should ideally be performed at the same One of the primary challenges facing designers of
scale and with the same configuration as the deployed modern network services is testing their dynamically
service. We present DieCast, an approach to scaling net- evolving system architecture. In addition to the sheer
work services in which we multiplex all of the nodes in scale of the target systems, challenges include: heteroge-
a given service configuration as virtual machines (VM) neous hardware and software, dynamically changing re-
spread across a much smaller number of physical ma- quest patterns, complex component interactions, failure
chines in a test harness. CPU, network, and disk are then conditions that only manifest under high load [21], the
accurately scaled to provide the illusion that each VM effects of correlated failures [20], and bottlenecks aris-
matches a machine from the original service in terms ing from complex network topologies. Before upgrad-
of both available computing resources and communi- ing any aspect of a networked service—the load balanc-
cation behavior to remote service nodes. We present ing/replication scheme, individual software components,
the architecture and evaluation of a system to support the network topology—architects would ideally create an
such experimentation and discuss its limitations. We exact copy of the system, modify the single component to
show that for a variety of services—including a commer- be upgraded, and then subject the entire system to both
cial, high-performance, cluster-based file system—and historical and worst-case workloads. Such testing must
resource utilization levels, DieCast matches the behav- include subjecting the system to a variety of controlled
ior of the original service while using a fraction of the failure and attack scenarios since problems with a par-
physical resources. ticular upgrade will often only be revealed under certain
specific conditions.
1 Introduction Creating an exact copy of a modern networked service
Today, more and more services are being delivered by for testing is often technically challenging and econom-
complex systems consisting of large ensembles of ma- ically infeasible. The architecture of many large-scale
chines spread across multiple physical networks and ge- networked services can be characterized as “controlled
ographic regions. Economies of scale, incremental scal- chaos,” where it is often impossible to know exactly what
ability, and good fault isolation properties have made the hardware, software, and network topology of the sys-
clusters the preferred architecture for building planetary- tem looks like at any given time. Even when the pre-
scale services. A single logical request may touch dozens cise hardware, software and network configuration of the
of machines on multiple networks, all providing in- system is known, the resources to replicate the produc-
stances of services transparently replicated across mul- tion environment might simply be unavailable, particu-
tiple machines. Services consisting of tens of thousands larly for large services. And yet, reliable, low overhead,
of machines are commonplace [11]. and economically feasible testing of network services re-
Economic considerations have pushed service mains critical to delivering robust higher-level services.
providers to a regime where individual service machines The goal of this work is to develop a testing method-

1
USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 407
ology and architecture that can accurately predict the be- lenges associated with testing large-scale network ser-
havior of modern network services while employing an vices (Section 6), we believe that DieCast shows signifi-
order of magnitude less hardware resources. For ex- cant promise as a testing vehicle
ample, consider a service consisting of 10,000 hetero-
geneous machines, 100 switches, and hundreds of indi- 2 System Architecture
vidual software configurations. We aim to configure a We begin by providing an overview of our approach to
smaller number of machines (e.g., 100-1000 depending scaling a system down to a target test harness. We then
on service characteristics) to emulate the original config- discuss the individual components of our architecture.
uration as closely as possible and to subject the test in-
frastructure to the same workload and failure conditions 2.1 Overview
as the original service. The performance and failure re- Figure 1 gives an overview of our approach. On the left
sponse of the test system should closely approximate the (Figure 1(a)) is an abstract depiction of a network ser-
real behavior of the target system. Of course, these goals vice. A load balancing switch sits in front of the service
are infeasible without giving something up: if it were and redirects requests among a set of front-end HTTP
possible to capture the complex behavior and overall per- servers. These requests may in turn travel to a middle
formance of a 10,000 node system on 1,000 nodes, then tier of application servers, who may query a storage tier
the original system should likely run on 1,000 nodes. consisting of databases or network attached storage.
A key insight behind our work is that we can trade Figure 1(b) shows how a target service can be scaled
time for system capacity while accurately scaling indi- with DieCast. We encapsulate all nodes from the origi-
vidual system components to match the behavior of the nal service in virtual machines and multiplex several of
target infrastructure. We employ time dilation to accu- these VMs onto physical machines in the test harness.
rately scale the capacity of individual systems by a con- Critically, we employ time dilation in the VMM run-
figurable factor [19]. Time dilation fully encapsulates ning on each physical machine to provide the illusion
operating systems and applications such that the rate at that each virtual machine has, for example, as much pro-
which time passes can be modified by a constant factor. cessing power, disk I/O, and network bandwidth as the
A time dilation factor (TDF) of 10 means that for every corresponding host in the original configuration despite
second of real time, all software in a dilated frame be- the fact that it is sharing underlying resources with other
lieves that time has advanced by only 100 ms. If we wish VMs. DieCast configures VMs to communicate through
to subject a target system to a one-hour workload when a network emulator to reproduce the characteristics of
scaling the system by a factor of 10, the test would take the original system topology. We then initialize the test
10 hours of real time. For many testing environments, system using the setup routines of the original system
this is an appropriate tradeoff. Since the passage of time and subject it to appropriate workloads and fault-loads to
is slowed down while the rate of external events (such evaluate system behavior.
as network I/O) remains unchanged, the system appears The overall goal is to improve predictive power. That
to have substantially higher processing power and faster is, runs with DieCast on smaller machine configurations
network and disk. should accurately predict the performance and fault tol-
In this paper, we present DieCast, a complete envi- erance characteristics of some larger production system.
ronment for building accurate models of network ser- In this manner, system developers may experiment with
vices (Section 2). Critically, we run the actual oper- changes to system architecture, network topology, soft-
ating systems and application software of some target ware upgrades, and new functionality before deploying
environment on a fraction of the hardware in that envi- them in production. Successful runs with DieCast should
ronment. This work makes the following contributions. improve confidence that any changes to the target ser-
First, we extend our original implementation of time di- vice will be successfully deployed. Below, we discuss
lation [19] to support fully virtualized as well as paravir- the steps in applying our general approach to applying
tualized hosts. To support complete system evaluations, DieCast scaling to target systems.
our second contribution shows how to extend dilation to
disk and CPU (Section 3). In particular, we integrate 2.2 Choosing the Scaling Factor
a full disk simulator into the virtual machine monitor The first question to address is the desired scaling fac-
(VMM) to consider a range of possible disk architec- tor. One use of DieCast is to reproduce the scale of an
tures. Finally, we conduct a detailed system evaluation, original service in a test cluster. Another application is
quantifying DieCast’s accuracy for a range of services, to scale existing test harnesses to achieve more realism
including a commercial storage system (Sections 4 and than possible from the raw hardware. For instance, if
5). The goals of this work are ambitious and while we 100 nodes are already available for testing, then DieCast
cannot claim to have addressed all of the myriad chal- might be employed to scale to a thousand-node system

408 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
(a) Original System (b) Test System
Figure 1: Scaling a network service to the DieCast infrastructure.

with a more complex communication topology. While not substantially increase the typically dominant human
the DieCast system may still fall short of the scale of cost of administering a given test infrastructure because
the original service, it can provide more meaningful ap- the number of required administrators for a given test
proximations under more intense workloads and failure harness usually grows with the number of machines in
conditions than might have otherwise been possible. the system rather than with the total memory of the sys-
Overall, the goal is to pick the largest scaling factor tem.
possible while still obtaining accurate predictions from Looking forward, ongoing research in VMM architec-
DieCast, since the prediction accuracy will naturally de- tures have the potential to reclaim some of the mem-
grade with increasing scaling factors. This maximum ory [32] and storage overhead [33] associated with multi-
scaling factor depends on the the characteristics of the plexing VMs on a single physical machine. For instance,
target system. Section 6 highlights the potential limita- four nearly identically configured Linux machines run-
tions of DieCast scaling. In general, scaling accuracy ning the same web server will overlap significantly in
will degrade with: i) application sensitivity to the fine- terms of their memory and storage footprints. Similarly,
grained timing behavior of external hardware devices; consider an Internet service that replicates content for im-
ii) capacity-constrained physical resources; and iii) sys- proved capacity and availability. When scaling the ser-
tem devices not amenable to virtualization. In the first vice down, multiple machines from the original configu-
category, application interaction with I/O devices may ration may be assigned to a single physical machine. A
depend on the exact timing of requests and responses. VMM capable of detecting and exploiting available re-
Consider for instance a fine-grained parallel application dundancy could significantly reduce the incremental stor-
that assumes all remote instances are co-scheduled. A age overhead of multiplexing multiple VMs.
DieCast run may mispredict performance if target nodes
are not scheduled at the time of a message transmission 2.3 Cataloging the Original System
to respond to a blocking read operation. If we could in-
terleave at the granularity of individual instructions, then The next task is to configure the appropriate virtual ma-
this would not be an issue. However, context switching chine images onto our test infrastructure. Maintaining a
among virtual machines means that we must pick time catalog of the hardware and software configuration that
slices on the order of milliseconds. Second, DieCast can- comprises an Internet service is challenging in its own
not scale the capacity of hardware components such as right. However, for the purposes of this work, we as-
main memory, processor caches, and disk. Finally, the sume that such a catalog is available. This catalog would
original service may contain devices such as load bal- consist of all of the hardware making up the service, the
ancing switches that are not amenable to virtualization or network topology, and the software configuration of each
dilation. Even with these caveats, we have successfully node. The software configuration includes the operating
applied scaling factors of 10 to a variety of services with system, installed packages and applications, and the ini-
near-perfect accuracy as discussed in Sections 4 and 5. tialization sequence run on each node after booting.
Of the above limitations to scaling, we consider capac- The original service software may or may not run on
ity limits for main memory and disk to be most signifi- top of virtual machines. However, given the increasing
cant. However, we do not believe this to be a fundamental benefits of employing virtual machines in data centers for
limitation. For example, one partial solution is to config- service configuration and management and the popular-
ure the test system with more memory and storage than ity of VM-based appliances that are pre-configured to run
the original system. While this will reduce some of the particular services [7], we assume that the original ser-
economic benefits of our approach, it will not erase them. vice is in fact VM-based. This assumption is not critical
For instance, doubling a machine’s memory will not typ- to our approach but it also partially addresses any base-
ically double its hardware cost. More importantly, it will line performance differential between a node running on

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 409
bare hardware in the original service and the same node requires forwarding data to it at 100 Mbps. Similarly,
running on a virtual machine in the test system. it may appear that latencies in an original cluster-based
service may be low enough that the additional software
2.4 Configuring the Virtual Machines forwarding overhead associated with the emulation en-
With an understanding of appropriate scaling factors and vironment could make it difficult to match the latencies
a catalog of the original service configuration, DieCast in the original network. To our advantage, maintaining
then configures individual physical machines in the test accurate latency with time dilation actually requires in-
system with multiple VM images reflecting, ideally, a creasing the real time delay of a given packet; e.g., a 100
one-to-one map between physical machines in the origi- µs delay network link in the original network should be
nal system and virtual machines in the test system. With delayed by 1 ms when dilating by a factor of 10.
a scaling factor of 10, each physical node in the target Note that the scaling factor need not match the TDF.
system would host 10 virtual machines. The mapping For example, if the original network topology is so
from physical machines to virtual machines should ac- large/fast that even with a TDF of 10 the network emu-
count for: similarity in software configurations, per-VM lator is unable to keep up, it is possible to employ a time
memory and disk requirements and the capacity of the dilation factor of 20 while maintaining a scaling factor of
hardware in the original and test system. In general, 10. In such a scenario, there would still on average be
a solver may be employed to determine a near-optimal 10 virtual machines multiplexed onto each physical ma-
matching [26]. However, given the VM migration capa- chine, however the VMM scheduler would allocate only
bilities of modern VMMs and DieCast’s controlled net- 5% of the physical machine’s resources to individual ma-
work emulation environment, the actual location of a VM chines (meaning that 50% of CPU resources will go idle).
is not as significant as in the original system. The TDF of 20, however, would deliver additional capac-
DieCast then configures the VMs such that each VM ity to the network emulation infrastructure to match the
appears to have resources identical to a physical machine characteristics of the original system.
in the original system. Consider a physical machine host- 2.6 Workload Generation
ing 10 VMs. DieCast would run each VM with a scaling
factor of 10, but allocate each VM only 10% of the actual Once DieCast has prepared the test system to be resource
physical resource. DieCast employs a non-work conserv- equivalent to the original system, we can subject it to
ing scheduler to ensure that each virtual machine receives an appropriate workload. These workloads will in gen-
no more than its allotted share of resources even when eral be application-specific. For instance, Monkey [15]
spare capacity is available. Suppose a CPU intensive task shows how to replay a measured TCP request stream sent
takes 100 seconds to finish on the original machine. The to a large-scale network service. For this work, we use
same task would now take 1000 seconds (of real time) on application-specific workload generators where available
a dilated VM, since it can only use a tenth of the CPU. and in other cases write our own workload generators that
However, since the VM is running under time dilation, both capture normal behavior as well as stress the service
it only perceives that 100 seconds have passed. Thus in under extreme conditions.
the VMs time frame, resources appear equivalent to the To maintain a target scaling factor, clients should also
original machine. We only explicitly scale CPU and disk ideally run in DieCast-scaled virtual machines. This ap-
I/O latency on the host; scaling of network I/O happens proach has the added benefit of allowing us to subject a
via network emulation as described next. test service to a high level of perceived-load using rela-
tively few resources. Thus, DieCast scales not only the
2.5 Network Emulation capacity of the test harness but also the workload gener-
The final step in the configuration process is to match the ation infrastructure.
network configuration of the original service using net-
work emulation. We configure all VMs in the test sys- 3 Implementation
tem to route all their communication through our emu- We have implemented DieCast support on several ver-
lation environment. Note that DieCast is not tied to any sions of Xen [10]: v2.0.7, v3.0.4, and v3.1 (both par-
particular emulation technology: we have successfully avirtualized and fully virtualized VMs). Here we focus
used DieCast with Dummynet [27], Modelnet [31] and on the Xen 3.1 implementation. We begin with a brief
Netem [3] where appropriate. overview of time dilation [19] and then describe the new
It is likely that the bisection bandwidth of the origi- features required to support DieCast.
nal service topology will be larger than that available in
the test system. Fortunately, time dilation is of signif- 3.1 Time Dilation
icant value here. Convincing a virtual machine scaled Critical to time dilation is a VMM’s ability to modify the
by a factor of 10 that it is receiving data at 1 Gbps only perception of time within a guest OS. Fortunately, most

410 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
VMMs already have this functionality, for example, be- agement, the general idea behind the implementation re-
cause a guest OS may develop a backlog of “lost ticks” mains the same: we want to intercept all sources of time
if it is not scheduled on the physical processor when it and scale them.
is due to receive a timer interrupt. Since the guest OS In particular, our implementation scales the PIT, the
running in a VM does not run continuously, VMMs peri- TSC register (on x86), the RTC (Real Time Clock), the
odically synchronize the guest OS time with the physical ACPI power management timer and the High Perfor-
machine’s clock. The only requirement for a VMM to mance Event Timer (HPET). As in the original imple-
support time dilation is this ability to modify the VM’s mentation, we also scale the number of timer interrupts
perception of time. In fact, as we demonstrate in Sec- delivered to a fully virtualized guest. We allow each VM
tion 5, the concept of time dilation can be ported to other to run with an independent scaling factor. Note, how-
(non-virtualized) environments. ever, that the scaling factor is fixed for the life time of a
Operating systems employ a variety of time sources VM—it can not be changed at run time.
to keep track of time, including timer interrupts (e.g., the
Programmable Interrupt Timer or PIT), specialized coun- 3.3 Scaling Disk I/O and CPU
ters (e.g., the TSC on Intel platforms) and external time Time dilation as described in [19] did not scale disk per-
sources such as NTP. Time dilation works by intercepting formance, making it unsuitable for services that perform
the various time sources and scaling them appropriately significant disk I/O. Ideally, we would scale individual
to fully encapsulate the OS in its own time frame. disk requests at the disk controller layer. The complexity
Our original modifications to Xen for paravirtualized of modern drive architectures, particularly the fact that
hosts [19] therefore appropriately scale time values ex- much low level functionality is implemented in firmware,
posed to the VM by the hypervisor. Xen exposes two makes such implementations challenging. Note that sim-
notions of time to VMs. Real time is the number of ply delaying requests in the device driver is not sufficient,
nanoseconds since boot, and wall clock time is the tradi- since disk controllers may re-order and batch requests for
tional Unix time since epoch. While Xen allows the guest efficiency. On the other hand, functionality embedded in
OS to maintain and update its own notion of time via an hardware or firmware is difficult to instrument and mod-
external time source (such as NTP), the guest OS often ify. Further complicating matters are the different I/O
relies solely on Xen to maintain accurate time. Real and models in Xen: one for paravirtualized (PV) VMs and
wall clock time pass between the Xen hypervisor and the one for fully virtualized (FV) VMs. DieCast provides
guest operating system via a shared data structure. Di- mechanisms to scale disk I/O for both models.
lation uses a per-domain TDF variable to appropriately For FV VMs, DieCast integrates a highly accurate and
scale real time and wall clock time. It also scales the fre- efficient disk system simulator — Disksim [17] — which
quency of timer interrupts delivered to a guest OS since gives us a good trade-off between realism and accuracy.
these timer interrupts often drive the internal time keep- Figure 2(a) depicts our integration of DiskSim into the
ing of a guest. Given these modifications to Xen, our fully virtualized I/O model: for each VM, a dedicated
earlier work showed that network dilation matches undi- user space process (ioemu) in Domain-0 performs I/O
lated baselines for complex per-flow TCP behavior in a emulation by exposing a “virtual disk” to the VM (the
variety of scenarios [19]. guest OS is unaware that a real disk is not present). A
special file in Domain-0 serves as the backend storage
3.2 Support for OS diversity for the VM’s disk. To allow ioemu to interact with
Our original time dilation implementation only worked DiskSim, we wrote a wrapper around the simulator for
with paravirtualized machines, with two major draw- inter-process communication.
backs: it supported only Linux as the guest OS, and After servicing each request (but before returning),
the guest kernel required modifications. Generalizing ioemu forwards the request to Disksim, which then re-
to other platforms would have required code modifi- turns the time, rt, the request would have taken in its
cations to the respective OS. To be widely applicable, simulated disk. Since we are effectively layering a soft-
DieCast must support a variety of operating systems. ware disk on top of ioemu, each request should ideally
To address these limitations, we ported time dilation to take exactly time rt in the VM’s time frame, or tdf ∗ rt
support fully virtualized (FV) VMs, enabling DieCast to in real time. If delay is the amount by which this re-
support unmodified OS images. Note that FV VMs re- quest is delayed, the total time spent in ioemu becomes
quire platforms with hardware support for virtualization, delay + dt + st, where st is the time taken to actually
such as Intel VT or AMD SVM. While Xen support for serve the request (Disksim only simulates I/O character-
fully virtualized VMs differs significantly from the par- istics, it does not deal with the actual disk content) and dt
avirtualized VM support in several key areas such as I/O is the time taken to invoke Disksim itself. The required
emulation, access to hardware registers, and time man- delay is then (tdf ∗ rt) − dt − st.

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 411
2000
CPU and Disk unscaled
CPU scaled (old)
Disk and CPU scaled
Disk and CPU scaled (improved)

Disk throughput (kB/s)


1500

1000

500

0
0 2 4 6 8 10
Time Dilation Factor

(a) I/O Model for FV VMs (b) I/O Model for PV VMs (c) DBench throughput under Disksim
Figure 2: Scaling Disk I/O

The architecture of Disksim, however, is not amenable However, as the TDF increases, we start to see some
to integration with the PV I/O model (Figure 2(b)). In divergence. After further investigation, we found that
this “split I/O” model, a front-end driver in the VM this deviation results from the way we scaled the CPU.
(blkfront) forwards requests to a back-end driver in Recall that we scale the CPU by bounding the amount
Domain-0 (blkback), which are then serviced by the of CPU available to each VM. Initially, we simply used
real disk device driver. Thus PV I/O is largely a kernel Xen’s Credit scheduler to allocate an appropriate fraction
activity, while Disksim runs entirely in user-space. Fur- of CPU resources to each VM in non-work conserving
ther, a separate Disksim process would be required for mode. However, simply scaling the CPU does not govern
each simulated disk, whereas there is a single back-end how those CPU cycles are distributed across time. With
driver for all VMs. the original Credit scheduler, if a VM does not consume
For these reasons, for PV VMs, we inject the appropri- its full timeslice, it can be scheduled again in subsequent
ate delays in the blkfront driver. This approach has timeslices. For instance, if a VM is set to be dilated by
the additional advantage of containing the side effects of a factor of 10 and if it consumes less than 10% of the
such delays to individual VMs — blkback can con- CPU in each time slice, then it will run in every time
tinue processing other requests as usual. Further, it elimi- slice, since in aggregate it never consumes more than its
nates the need to modify disk-specific drivers in Domain- hard bound of 10% of the CPU. This potential to run con-
0. We emphasize that this is functionally equivalent to tinuously distorts the performance of I/O-bound applica-
per-request scaling in Disksim: the key difference is that tions under dilation, and in particular they’ll have a dif-
scaling in Disksim is much closer to the (simulated) hard- ferent timing distribution than they would in the real time
ware. Overall our implementation of disk scaling for PV frame. This distortion increases with increasing TDF.
VM’s is simpler though less accurate and somewhat less Thus, we found that, for some workloads, we may ac-
flexible since it requires the disk subsystem in the testing tually wish to enforce that the VM’s CPU consumption
hardware to match the configuration in the target system. should be more uniformly enforced across time.
We have validated both our implementations using
several micro-benchmarks. For brevity, we only describe We modified the Credit CPU scheduler in Xen to sup-
one of them here. We run DBench [29] — a popular hard- port this mode of operation as follows: if a VM runs for
drive and file-system benchmark — under different dila- the entire duration of its time slice, we ensure that it does
tion factors and plot the reported throughput. Figure 2(c) not get scheduled for the next (tdf − 1) time slices. If a
shows the results for the FV I/O model with Disksim in- VM voluntarily yields the CPU or is pre-empted before
tegration (results for the PV implementation can be found its time slice expires, it may be re-scheduled in a sub-
in a separate technical report [18]). Ideally, the through- sequent time slice. However, as soon as it consumes a
put should remain constant as a function of the dilation cumulative total of a time slice’s worth of run time (car-
factor. We first run the benchmark without scaling disk ried over from the previous time it was descheduled), it
I/O or CPU, and we can see that the reported throughput will be pre-empted and not allowed to run for another
increases almost linearly, an undesirable behavior. Next, (tdf − 1) time slices. The final line in figure 2(c) shows
we repeat the experiment and scale the CPU alone (thus, the results of the DBench benchmark with using this
at TDF 10 the VM only receives 10% of the CPU). While modified scheduler. As we can see, the throughput re-
the increase is no longer linear, in the absence of disk mains consistent even at higher TDFs. Note that unlike
dilation it is still significantly higher than the expected in this benchmark, DieCast typically runs multiple VMs
value. Finally, with disk dilation in place we can see that per machine, in which case this “spreading” of CPU cy-
the throughput closely tracks the expected value. cles occurs naturally as VMs compete for CPU.

412 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
4 Evaluation across a ModelNet-emulated dumbbell topology (Figure
3), with varying bandwidth and latency values for the ac-
We seek to answer the following questions with respect cess link (A) from each client to the dumbbell and the
to DieCast-scaling: i) Can we configure a smaller num- dumbbell link itself (C). We vary the total number of
ber of physical machines to match the CPU capacity, clients, the file size, the network topology, and the ver-
complex network topology, and I/O rates of a larger ser- sion of the BitTorrent software. We use the distribution
vice? ii) How well does the performance of a scaled ser- of file download times across all clients as the metric for
vice running on fewer resources match the performance comparing performance. The aim here is to observe how
of a baseline service running with more resources? we closely DieCast-scaled experiments reproduce behavior
consider three different systems: i) BitTorrent, a popular of the baseline case for a variety of scenarios.
peer-to-peer file sharing program; ii) RUBiS, an auction The first experiment establishes the baseline where we
service prototyped after eBay; and iii) Isaac, our config- compare different configurations of BitTorrent sharing a
urable network three-tier service that allows us to gener- file across a 10Mbps dumbbell link and constrained ac-
ate a range of workload scenarios. cess links of 10Mbps. All links have a one-way latency
of 5ms. We run a total of 40 clients (with half on each
4.1 Methodology
side of the dumbbell). Figure 5 plots the cumulative dis-
To evaluate DieCast for a given system, we first estab- tribution of transfer times across all clients for different
lish the baseline performance: this involves determining file sizes (10MB and 50MB). We show the baseline case
the configuration(s) of interest, fixing the workload, and using solid lines and use dashed lines to represent the
benchmarking the performance. We then scale the sys- DieCast-scaled case. With DieCast scaling, the distribu-
tem down by an order of magnitude and compare the tion of download times closely matches the behavior of
DieCast performance to the baseline. While we have ex- the original system. For instance, well-connected clients
tensively evaluated evaluated DieCast implementations on the same side of the dumbbell as the randomly cho-
for several versions of Xen, we only present the results sen seeder finish more quickly than the clients that must
for the Xen 3.1 implementation here. Detailed evaluation compete for scarce resources across the dumbbell.
for Xen 3.0.4 can be found in our technical report [18]. Having established a reasonable baseline, we next con-
Each physical machine in our testbed is a dual-core sider sensitivity to changing system configurations. We
2.3GHz Intel Xeon with 4GB RAM. Note that since the first vary the network topology by leaving the dumbbell
Disksim integration only works with fully virtualized link unconstrained (1 Gbps) with results in Figure 5. The
VMs, for a fair evaluation it is required that even the graph shows the effect of removing the bottleneck on the
baseline system run on VMs—ideally the baseline would finish times compared to the constrained dumbbell-link
be run on physical machines directly (for the paravirtual- case for the 50-MB file: all clients finish within a small
ized setup, we do have evaluation with physical machines time difference of each other as shown by the middle pair
as the baseline. We refer the reader to [18] for details). of curves.
We configure Disksim to emulate a Seagate ST3217 disk Next, we consider the effect of varying the total num-
drive. For the baseline, Disksim runs as usual (no re- ber of clients. Using the topology from the baseline ex-
quests are scaled) and with DieCast, we scale each re- periment we repeat the experiments for 80 and 200 simul-
quest as described in Section 3.3. taneous BitTorrent clients. Figure 6 shows the results.
We configure each virtual machine with 256MB RAM The curves for the baseline and DieCast-scaled versions
and run Debian Etch on Linux 2.6.17. Unless otherwise almost completely overlap each other for 80 clients (left
stated, the baseline configuration consists of 40 physical pair of curves) and show minor deviation from each other
machines hosting a single VM each. We then compare for 200 clients (right pair of curves). Note that with 200
the performance characeteristics to runs with DieCast on clients, the bandwidth contention increases to the point
four physical machines hosting 10 VMs each, scaled by where the dumbbell bottleneck becomes less important.
a factor of 10. We use Modelnet for the network emu- Finally, we consider an experiment that demonstrates
lation, and appropriately scale the link characteristics for the flexibility of DieCast to reproduce system perfor-
DieCast. For allocating CPU, we use our modified Credit mance under a variety of resource configurations start-
CPU scheduler as described in Section 3.3. ing with the same baseline. Figure 7 shows that in addi-
tion to matching 1:10 scaling using 4 physical machines
4.2 BitTorrent hosting 10 VMs each, we can also match an alternate
We begin by using DieCast to evaluate BitTorrent [1] — configuration of 8 physical machines, hosting five VMs
a popular P2P application. For our baseline experiments, each with a dilation factor of five. This demonstrates that
we run BitTorrent (version 3.4.2) on a total of 40 virtual even if it is necessary to vary the number of physical ma-
machines. We configure the machines to communicate chines available for testing, it may still be possible to find

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 413
Figure 3: Topology for BitTorrent experiments. Figure 4: RUBiS Setup.

1.0 1.0
1.0 Baseline Baseline
DieCast DieCast TDF 10
Cumulative Fraction
0.8 0.8 DieCast TDF 5

Cumulative Fraction
0.8 1-Gbps Bottleneck No DieCast
Cumulative Fraction

0.6 0.6 80 clients 0.6


10 MB 50 MB

0.4 0.4 0.4

0.2 200 clients 0.2 No DieCast


0.2
Baseline
Diecast
0.0 0.0 0.0
0 20 40 60 80 100 120 140 160 0 100 200 300 400 500 0 20 40 60 80 100 120 140 160 180
Time to complete since start of experiment (s) Time to complete since start of experiment (s) Time to complete since start of experiment (s)

Figure 5: Performance with varying file sizes. Figure 6: Varying #clients. Figure 7: Different configurations.

an appropriate scaling factor to match performance char- We emulate a topology of 40 nodes consisting of 8
acteristics. This graph also has a fourth curve, labeled database servers, 16 web servers and 16 workload gen-
“No DieCast”, corresponding to running the experiment erators as shown in Figure 4. A 100 Mbps network
with 40 VMs on four physical machines, each with a di- link connects two replicas of the service spread across
lation factor of 1—disk and network are not scaled (thus the wide-area at two sites. Within a site, 1 Gbps links
match the baseline configuration), and all VMs are allo- connect all components. For reliability, half of the web
cated equal shares of the CPU. This corresponds to the servers at each site use the database servers in the other
approach of simply multiplexing a number of virtual ma- site. There is one load generator per web server and all
chines on physical machines without using DieCast. The load generators share a 100 Mbps access link. Each sys-
graph shows that the behavior of the system under such a tem component (servers, workload generators) runs in its
nave approach varies widely from actual behavior. own Xen VM.
4.3 RUBiS We now evaluate DieCast’s ability to predict the be-
Next, we investigate DieCast’s ability to scale a fully havior of this RUBiS configuration using fewer re-
functional Internet service. We use RUBiS [6]—an auc- sources. Figures 8(a) and 8(b) compare the baseline
tion site prototype designed to evaluate scalability and performance with the scaled system for overall system
application server performance. RUBiS has been used throughput and average response time (across all client-
by other researchers to approximate realistic Internet Ser- webserver combinations) on the y-axis as a function of
vices [12–14]. number of simultaneous clients (offered load) on the x-
We use the PHP implementation of RUBiS running axis. In both cases, the performance of the scaled ser-
Apache as the web server and MySQL as the database. vice closely tracks that of the baseline. We also show the
For consistent results, we re-create the database and pre- performance for the “No DieCast” configuration: reg-
populate it with 100,000 users and items before each ex- ular VM multiplexing with no DieCast-scaling. With-
periment. We use the default read-write transaction ta- out DieCast to offset the resource contention, the aggre-
ble for the workload that exercises all aspects of the sys- gate throughput drops with a substantial increase in re-
tem such as adding new items, placing bids, adding com- sponse times. Interestingly, for one of our initial tests, we
ments, viewing and browsing the database. The RUBiS ran with an unintended mis-configuration of the RUBiS
workload generators warm up for 60 seconds, followed database: the workload had commenting-related opera-
by a session run time of 600 seconds and ramp down for tions enabled, but the relevant tables were missing from
60 seconds. the database. This led to an approximately 25% error rate

414 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
60000 8000
Aggregate Throughput (requests/min)

Baseline Baseline
DieCast 7000 DieCast
50000 No DieCast No DieCast
6000

Response time (ms)


40000
5000

30000 4000

3000
20000

2000

10000
1000

0 0
0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000
Total system load (User sessions) Total system load (User sessions)

(a) Throughput (b) Response Time


Figure 10: Architecture of Isaac.
Figure 8: Comparing RUBiS application performance: Baseline vs. DieCast.

100
100
Baseline DB server
DieCast
CPU used (%)

80

Memory utilization (%)


80
60
Client Client
40 60

20 Web server
40
0
100
20
CPU used (%)

80 Baseline
DieCast
0
60 0 200 400 600 800 1000 1200
Web server Time since start of experiment (s)
40
(b) Memory profile
20
104
0 Baseline
100 DieCast
103

Data transferred (MB)


CPU used (%)

80

60 102
DB server
40
101
20
100
0
0 200 400 600 800 1000
Time since start of experiment (s) 10-1
0 10 20 30 40 50 60 70 80 90
Hop ID

(a) CPU profile (c) Network profile


Figure 9: Comparing resource utilization for RUBiS: DieCast can accurately emulate the baseline system behavior.

with similar timings in the responses to clients in both the amount of data transferred per link in the baseline case.
baseline and DieCast configurations. These types of con- This graph demonstrates that DieCast closely tracks and
figuration errors are one example of the types of testing reproduces variability in network utilization for various
that we wish to enable with DieCast. hops in the topology. For instance, hops 86 and 87 in the
Next, Figures 9(a) and 9(b) compare CPU and mem- figure correspond to access links of clients and show the
ory utilizations for both the scaled and unscaled experi- maximum utilization, whereas individual access links of
ments as a function of time for the case of 4800 simul- Webservers are moderately loaded.
taneous user sessions: we pick one node of each type
(DB server, Web server, load generator) at random from
4.4 Exploring DieCast Accuracy
the baseline, and use the same three nodes for compari- While we were encouraged by DieCast’s ability to scale
son with DieCast. One important question is whether the RUBiS and BitTorrent, they represent only a few points
average performance results in earlier figures hide signif- in the large space of possible network service configura-
icant incongruities in per-request performance. Here, we tions, for instance, in terms of the ratios of computation
see that resource utilization in the DieCast-scaled exper- to network communication to disk I/O. Hence, we built
iments closely tracks the utilization in the baseline on a Isaac, a configurable multi-tier network service to stress
per-node and per-tier (client, web server, database) ba- the DieCast methodology on a range of possible config-
sis. Similarly, Figure 9(c) compares the network utiliza- urations. Figure 10 shows Isaac’s architecture. Requests
tion of individual links in the topology for the baseline originating from a client (C) travel to a unique front end
and DieCast-scaled experiment. We sort the links by the server (F S) via a load balancer (LB). The FS makes

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 415
1.0 100 1.0
Baseline Baseline
DieCast DieCast
Fraction of requests completed

Fraction of requests completed


Time Spent in Each Tier (%)
0.8 80 0.8
No DieCast stress-DB

0.6 60 0.6
No DieCast stress-CPU
CPU stress

0.4 40 0.4

100-KB writes

0.2 20 0.2

0.0 0 0.0
0 100 200 300 400 500 600 700 DB AS FS 0 200 400 600 800 1000 1200 1400 1600
Time since start of experiment (s) Different Tiers Time since start of experiment (s)

Figure 11: Request completion time. Figure 12: Tier-breakdown. Figure 13: Stressing DB/CPU.

a number of calls to other services through application on the corresponding nodes. As a result, client requests
servers (AS). These application servers in turn may is- accessing failed databases will not complete, slowing the
sue read and write calls to a database back end (DB) rate of completed requests. After one minute of down-
before building a response and transmitting it back to the time, we restart the MySQL server and soon after we
front end server, which finally responds to the client. expect to see the request completion rate to regain its
Isaac is written in Python and allows configuring the original value. Figure 11 shows fraction of requests com-
service to a given interconnect topology, computation, pleted on the Y-axis as a function of time since the start of
communication, and I/O pattern. A configuration de- the experiment on the X-axis. DieCast closely matches
scribes, on a per request class basis, the computation, the baseline application behavior with a dilation factor
communication, and I/O characteristics across multiple of 10. We also compare the percentage of time spent
service tiers. In this manner, we can configure experi- in each of the three tiers of Isaac averaged across all re-
ments to stress different aspects of a service and to in- quests. Figure 12 shows that in addition to the end-to-end
dependently push the system to capacity along multiple response time, DieCast closely tracks the system behav-
dimensions. We use MySQL for the database tier to re- ior on a per-tier basis.
flect a realistic transactional storage tier. Encouraged by the results of the previous experi-
For our first experiment, we configure Isaac with four ment, we next attempt to saturate individual compo-
DBs, four ASs, four FSs and 28 clients. The clients gen- nents of Isaac to explore the limits of DieCast’s accuracy.
erate requests, wait for responses, and sleep for some First, we evaluate DieCast’s ability to scale network ser-
time before generating new requests. Each client gener- vices when database access dominates per-request ser-
ates 20 requests and each such request touches five ASs vice time. Figure 13 shows the completion time for re-
(randomly selected at run time) after going through the quests, where each service issues a 100-KB (rather than
FS. Each request from the AS involves 10 reads from and 1-KB) write to the database with all other parameters re-
2 writes to a database each of size 1KB. The database maining the same. This amounts to a total of 1 MB of
server is also chosen randomly at runtime. Upon com- database writes for every request from a client. Even for
pleting its database queries, each AS computes 500 SHA- these larger data volumes, DieCast faithfully reproduces
1 hashes of the response before sending it back to the FS. system performance. While for this workload, we are
Each FS then collects responses from all five AS’s and fi- able to maintain good accuracy, the evaluation of disk di-
nally computes 5,000 SHA-1 hashes on the concatenated lation summarized in Figure 2(c) suggests that there will
results before replying to the client. In later experiments, certainly be points where disk dilation inaccuracy will
we vary both the amount of computation and I/O to quan- affect overall DieCast accuracy.
tify sensitivity to varying resource bottlenecks Next, we evaluate DieCast accuracy when one of
We perform this 40-node experiment both with and the components in our architecture saturates the CPU.
without DieCast. For brevity, we do not show the re- Specifically, we configure our front-end servers such that
sults of initial tests validating DieCast accuracy (in all prior to sending each response to the client, they compute
cases, performance matched closely in both the dilated SHA-1 hashes of the response 500,000 times to artifi-
and baseline case). Rather, we run a more complex ex- cially saturate the CPU of this tier. The results of this ex-
periment where a subset of the machines fail and then periment too are shown in Figure 13. We are encouraged
recover. Our goal is to show that DieCast can accurately overall as the system does not significantly diverge even
match application performance before the failure occurs, to the point of CPU saturation. For instance, the CPU
during the failure scenario, and the application’s recovery utilization for nodes hosting the FS in this experiment
behavior. After 200 seconds, we fail half of the database varied from 50 − 80% for the duration of the experiment
servers (chosen at random) by stopping MySQL servers and even under such conditions DieCast closely matched

416 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
450
the baseline system performance. The “No DieCast” Baseline Read
Baseline Write
400
lines plot the performance of the stress-DB and stress- DieCast Read
DieCast Write

Throughput (MB/s)
350
CPU configurations with regular VM multiplexing with- IOZone
300
out DieCast-scaling. As with BitTorrent and RUBiS, we
250
see that without DieCast, the test infrastructure fails to
200
predict the performance of the baseline system.
150
MPI-IO
5 Commercial System Evaluation 100

50
64K 1M 2M 4M 16M
While we were encouraged by DieCast’s accuracy for the Block sizes

applications we considered in Section 4, all of the ex-


Figure 14: Validating DieCast on PanFS.
periments were designed by DieCast authors and were
largely academic in nature. To understand the generality
which was unavailable in the hardware Panasas was us-
of our system, we consider its applicability to a large-
ing. Even if we had the hardware, Xen did not support
scale commercial system.
FreeBSD on FV VMs until recently due to a well known
Panasas [4] builds scalable storage systems target-
bug [2]. Thus, unfortunately we could not easily employ
ing Linux cluster computing environments. It has sup-
the existing time dilation techniques with PanFS on the
plied solutions to several government agencies, oil and
server side. However, since we believe DieCast concepts
gas companies, media companies and several commer-
are general and not restricted to Xen, we took this oppor-
cial HPC enterprises. A core component of Panasas’s
tunity to explore whether we could modify the PanFS OS
products is the PanFS parallel filesystem (henceforth re-
to support DieCast, without any virtualization support.
ferred to as PanFS): an object-based cluster filesystem
To implement time dilation in the PanFS kernel, we
that presents a single, cache coherent unified namespace
scale the various time sources , and consequently, the
to clients.
wall clock. The TDF can be specified at boot time as
To meet customer requirements, Panasas must ensure a kernel parameter. As before, we need to scale down
its systems can deliver appropriate performance under a resources available to PanFS such that its perceived ca-
range of client access patterns. Unfortunately, it is of- pacity matches the baseline.
ten impossible to create a test environment that reflects
For scaling the network, we use Dummynet [27],
the setup at a customer site. Since Panasas has several
which ships as part of the PanFS OS. However, there was
customers with very large super-computing clusters and
no mechanism for limiting the CPU available to the OS,
limited test infrastructure at its disposal, its ability to per-
or to slow the disk. The PanFS OS does not support non
form testing at scale is severely restricted by hardware
work-conserving CPU allocation. Further, simply modi-
availability; exactly the type of situation DieCast tar-
fying the CPU scheduler for user processes is insufficient
gets. For example, the Los Alamos National Lab has de-
because it would not throttle the rate of kernel process-
ployed PanFS with its Roadrunner peta-scale super com-
ing. For CPU dilation, we had to modify the kernel as
puter [5]. The Roadrunner system is designed to deliver
follows. We created a CPU-bound task, (idle), in the
a sustained performance level of one petaflop at an esti-
kernel and we statically assigned it the highest schedul-
mated cost of $90 million. Because of the tremendous
ing priority. We scale the CPU by maintaining the re-
scale and cost, Panasas cannot replicate this computing
quired ratio between the run times of the idle task and
environment for testing purposes.
all remaining tasks. If the idle task consumes suffi-
Porting Time Dilation. In evaluating our ability to ap- cient CPU, it is removed from the run queue and the reg-
ply DieCast to PanFS, we encountered one primary limi- ular CPU scheduler kicks in. If not, the scheduler always
tation. PanFS clients use a Linux kernel module to com- picks the idle task because of its priority.
municate with the PanFS server. The client-side code For disk dilation, we were faced by the complication
runs on recent versions of Xen , and hence, DieCast sup- that multiple hardware and software components interact
ported them with no modifications. However, the PanFS in PanFS to service clients. For performance, there are
server runs in a custom operating system derived from an several parallel data paths and many operations are either
older version of FreeBSD that does not support Xen. The asynchronous or cached. Accurately implementing disk
significant modifications to the base FreeBSD operating dilation would require accounting for all of the possible
system made it impossible to port PanFS to a more re- code paths as well as modeling the disk drives with high
cent version of FreeBSD that does support Xen. Ideally, fidelity. In an ideal implementation, if the physical ser-
it would be possible to simply encapsulate the PanFS vice time for a disk request is s and the TDF is t, then the
server in a fully virtualized Xen VM. However, recall request should be delayed by time (t − 1)s such that the
that this requires virtualization support in the processor total physical service time becomes t × s, which under

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 417
dilation would be perceived as the desired value of s. Aggregate Number of clients
Unfortunately, the Panasas operating system only pro- Throughput
vides coarse-grained kernel timers. Consequently, sleep 10 250 1000
calls with small durations tend to be inaccurate. Using Write 370 MB/s 403 MB/s 398 MB/s
a number of micro-benchmarks, we determined that the Read 402 MB/s 483 MB/s 424 MB/s
smallest sleep interval that could be accurately imple-
Table 1: Aggregate read/write throughputs from the IOZone benchmark
mented in the PanFS operating system was 1 ms. with block size 16M. PanFS performance scales gracefully with larger
This limitation affects the way disk dilation can be im- client populations.
plemented. For I/O intensive workloads, the rate of disk
requests is high. At the same time, the service time of throughput while triangles mark write throughput. We
each request is relatively modest. In this case, delaying use solid lines for the baseline and dashed lines for the
each request individually is not an option, since the over- DieCast-scaled configuration. For both reads and writes,
head of invoking sleep dominates the injected delay and DieCast closely follows baseline performance, never di-
gives unexpectedly large slowdowns. Thus, we chose to verging by more than 5% even for unusually large block
aggregate delays across some number of requests whose sizes.
service time sums to more than 1 ms and periodically in-
ject delays rather than injecting a delay for each request. Scaling With sufficient faith in the ability of DieCast to
Another practical limitation is that it is often difficult to reproduce performance for real-world application work-
accurately bound the service time of a disk request. This loads we next aim to push the scale of the experiment
is a result of the various I/O paths that exist: requests can beyond what Panasas can easily achieve with their exist-
be synchronous or asynchronous, they can be serviced ing infrastructure.
from the cache or not and so on. We are interested in the scalability of PanFS as we
While we realize that this implementation is imperfect, increase the number of clients by two orders of magni-
it works well in practice and can be automatically tuned tude. To achieve this, we design an experiment similar
for each workload. A perfect implementation would have to the one above, but this time we fix the block size at
to accurately model the low level disk behavior and im- 16MB and vary the number of clients. We use 10 VMs
prove the accuracy of the kernel sleep function. Because each on 25 physical machines to support 250 clients to
operating systems and hardware will increasingly sup- run the IOZone benchmark. We further scale the exper-
port native virtualization, we feel that our simple disk di- iment by using 10 VMs each on 100 physical machines
lation implementation targeting individual PanFS work- to go up to 1000 clients. In each case, all VMs are run-
loads is reasonable in practice to validate our approach. ning at a TDF of 10. The PanFS server also runs at a
Validation We first wish to establish DieCast accuracy TDF of 10 and all resources (CPU, network, disk) are
by running experiments on bare hardware and comparing scaled appropriately. Table 1 shows that the performance
them against DieCast-scaled virtual machines. We start of PanFS with increasing client population. Interestingly,
by setting up a storage system consisting of an PanFS we find relatively little increase in throughput as we in-
server with 20 disks of capacity 250GB each (5TB total crease the client population. Upon investigating further,
storage). We evaluate two benchmarks from the stan- we found that a single PanFS server configuration is lim-
dard bandwidth test suite used by Panasas. The first ited to 4 Gb/s (500 MB/s) of aggregate bisection band-
benchmark involves 10 clients (each on a separate ma- width between the servers and clients (including any IP
chine) running IOZone [23]. The second benchmark uses and filesystem overhead). While our network emulation
the Message Passing Interface (MPI) across 100 clients accurately reflected this bottleneck, we did not catch the
(again, on separate machines) [28]. bottleneck until we ran our experiments. We leave a per-
For DieCast scaling, we repeat the experiment with our formance evaluation when removing this bottleneck to
modifications to the PanFS server configured to enforce a future work.
dilation factor of 10. Thus, we allocate 10% of the CPU We would like to emphasize that prior to our experi-
to the server and dilate the network using Dummynet to ment, Panasas had been unable to perform experiments at
10% of the physical bandwidth and 10 times the latency this scale. This is in part due to the fact that such a large
(to preserve the bandwidth-delay product). On the client number of machines might not be available at any given
side, we have all clients running in separate virtual ma- time for a single experiment. Further, even if machines
chines (10 VMs per physical machine), each receiving are available, blocking a large number of machines re-
10% of the CPU with a dilation factor of 10. sults in significant resource contention because several
Figure 14 plots the aggregate client throughput for other smaller experiments are then blocked on avail-
both experiments on the y-axis as a function of the ability of resources. Our experiments demonstrate that
data block size on the x-axis. Circles mark the read DieCast can leverage existing resources to work around

418 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
these types of problems. leaks, will only manifest after running for a significant
period of time. Given that we inflate the amount of time
6 DieCast Usage Scenarios required to carry out a test, it may take too long to isolate
In this section, we discuss DieCast’s applicability and these types of errors using DieCast.
limitations for testing large-scale network services in a Multiplexing multiple virtual machines onto a single
variety of environments. physical machine, running with an emulated network,
DieCast aims to reproduce the performance of an orig- and dilating time will introduce some error into the pro-
inal system configuration and is well suited for predict- jected behavior of target services. This error has been
ing the behavior of the system under a variety of work- small for the network services and scenarios we evalu-
loads. Further, because the test system can be subject to ate in this paper. In general however, DieCast’s accuracy
a variety of realistic and projected client access patterns, will be service and deployment-specific. We have not
DieCast may be employed to verify that the system can yet established an overall limit to DieCast’s scaling abil-
maintain the terms of Service Level Agreements (SLA). ity. In separate experiments not reported in this paper, we
It runs in a controlled and partially emulated network have successfully run with scaling factors of 100. How-
environment. Thus, it is relatively straightforward to con- ever, in these cases, the limitation of time itself becomes
sider the effects of revamping a service’s network topol- significant. Waiting 10 times longer for an experiment
ogy (e.g., to evaluate whether an upgrade can alleviate to configure is often reasonable, but waiting 100 times
a communication bottleneck). DieCast can also system- longer becomes difficult.
atically subject the system to failure scenarios. For ex- Some services employ a variety of custom hardware,
ample, system architects may develop a suite of fault- such as load balancing switches, firewalls, and storage
loads to determine how well a service maintains response appliances. In general, it may not be possible to scale
times, data quality, or recovery time metrics. Similarly, such hardware in our test environment. Depending on
because DieCast controls workload generation it is ap- the architecture of the hardware, one approach is to wrap
propriate for considering a variety of attack conditions. the various operating systems for such cases in scaled vir-
For instance, it can be used to subject an Internet service tual machines. Another approach is to run the hardware
to large-scale Denial-of-Service attacks. DieCast may itself and to build custom wrappers to intercept requests
enable evaluation of various DOS mitigation strategies and responses, scaling them appropriately. A final option
or software architectures. is to run such hardware unscaled in the test environment,
Many difficult-to-isolate bugs result from system con- introducing some error in system performance. Our work
figuration errors (e.g., at the OS, network, or application with PanFS shows that it is feasible to scale unmodified
level) or inconsistencies that arise from “live upgrades” services into the DieCast environment with relatively lit-
of a service. The resulting faults may only manifest as tle work on the part of the developer.
errors in a small fraction of requests and even then after
a specific sequence of operations. Operator errors and
7 Related Work
mis-configurations [22,24] are also known to account for Our work builds upon previous efforts in a number of
a significant fraction of service failures. DieCast makes it areas. We discuss each in turn below.
possible to capture the effects of mis-configurations and Testing scaled systems SHRiNK [25] is perhaps most
upgrades before a service goes live. closely related to DieCast in spirit. SHRiNK aims to
At the same time, DieCast will not be appropriate evaluate the behavior of faster networks by simulat-
for certain service configurations. As discussed earlier, ing slower ones. For example, their “scaling hypothe-
DieCast is unable to scale down the memory or storage sis” states that the behavior of 100Mbps flows through
capacity of a service. Services that rely on multi-petabyte a 1Gbps pipe should be similar to 10Mbps through a
data sets or saturate the physical memories of all of their 100Mbps pipe. When this scaling hypothesis holds, it
machines with little to no cross-machine memory/storage becomes possible to run simulations more quickly and
redundancy may not be suitable for DieCast testing. If with a lower memory footprint. Relative to this effort, we
system behavior depends heavily on the behavior of the show how to scale fully operational computer systems,
processor cache, and if multiplexing multiple VMs onto considering complex interactions among CPU, network,
a single physical machine results in significant cache pol- and disk spread across many nodes and topologies.
lution, then DieCast may under-predict the performance Testing through Simulation and Emulation One
of certain application configurations. popular approach to testing complex network services is
DieCast may change the fine-grained timing of indi- through building a simulation model of system behavior
vidual events in the test system. Hence, DieCast may not under a variety of access patterns. While such simula-
be able to reproduce certain race conditions or timing er- tions are valuable, we argue that simulation is best suited
rors in the original service. Some bugs, such as memory to understanding coarse-grained performance character-

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 419
istics of certain configurations. Simulation is less suited 8 Conclusion
to configuration errors or to capturing the effects of un-
expected component interactions, failures, etc. Testing network services remains difficult because of
their scale and complexity. While not technically or eco-
Superficially, emulation techniques (e.g. Emulab [34] nomically feasible, a comprehensive evaluation would
or ModelNet [31]), offer a more realistic alternative to require running a test system identically configured to
simulation because they support running unmodified ap- and at the same scale as the original system. Such test-
plications and operating systems. Unfortunately, such ing should enable finding performance anomalies, failure
emulation is limited by the capacity of the available phys- recovery problems, and configuration errors under a vari-
ical hardware and hence is often best suited to consider- ety of workloads and failure conditions before triggering
ing wide-area network conditions (with smaller bisection corresponding errors during live runs.
bandwidths) or smaller system configurations. For in- In this paper, we present a methodology and frame-
stance, multiplexing 1000 instances of an overlay across work to enable system testing to more closely match
50 physical machines interconnected by Gigabit Ether- both the configuration and scale of the original system.
net may be feasible when evaluating a file sharing ser- We show how to multiplex multiple virtual machines,
vice on clients with cable modems. However, the same each configured identically to a node in the original sys-
50 machines will be incapable of emulating the network tem, across individual physical machines. We then di-
or CPU characteristics of 1000 machines in a multi-tier late individual machine resources, including CPU cycles,
network service consisting of dozens of racks and high- network communication characteristics, and disk I/O, to
speed switches. provide the illusion that each VM has as much comput-
Time Dilation DieCast leverages earlier work on Time ing power as corresponding physical nodes in the orig-
Dilation [19] to assist with scaling the network configura- inal system. By trading time for resources, we enable
tion of a target service. This earlier work focused on eval- more realistic tests involving more hosts and more com-
uating network protocols on next-generation networking plex network topologies than would otherwise be pos-
topologies, e.g., the behavior on TCP on 10Gbps Ether- sible on the underlying hardware. While our approach
net while running on 1Gbps Ethernet. Relative to this does add necessary storage and multiplexing overhead,
previous work, DieCast improves upon time dilation to an evaluation with a range of network services, includ-
scale down a particular network configuration. In addi- ing a commercial filesystem, demonstrates our accuracy
tion, we demonstrate that it is possible to trade time for and the potential to significantly increase the scale and
compute resources while accurately scaling CPU cycles, realism of testing network services.
complex network topologies, and disk I/O. Finally, we
demonstrate the efficacy of our approach end-to-end for Acknowledgements
complex, multi-tier network services. The authors would like to thank Tejasvi Aswatha-
Detecting Performance Anomalies There have been narayana, Jeff Butler and Garth Gibson at Panasas for
a number of recent efforts to debug performance anoma- their guidance and support in porting DieCast to their
lies in network services, including Pinpoint [14], Mag- systems. We would also like to thank Marvin McNett and
Pie [9], and Project 5 [8]. Each of these initiatives an- Chris Edwards for their help in managing some of the in-
alyzes the communication and computation across mul- frastructure. Finally, we would like to thank our shep-
tiple tiers in modern Internet services to locate perfor- herd Steve Gribble, and our anonymous reviewers for
mance anomalies. These efforts are complementary to their time and insightful comments—they helped tremen-
ours as they attempt to locate problems in deployed sys- dously in improving the paper.
tems. Conversely, the goal of our work is to test particu-
lar software configurations at scale to locate errors before References
they affect a live service. [1] BitTorrent. http://www.bittorrent.com.
Modeling Internet Services Finally, there have been [2] FreeBSD bootloader stops with BTX halted in hvm
many efforts to model the performance of network ser- domU. http://bugzilla.xensource.com/
vices to, for example, dynamically provision them in re- bugzilla/show_bug.cgi?id=622.
sponse to changing request patterns [16, 30] or to reroute [3] Netem. http://linux-net.osdl.org/index.
requests in the face of component failures [12]. Once php/Netem.
again, these efforts typically target already running ser- [4] Panasas. http://www.panasas.com.
vices relative to our goal of testing service configura- [5] Panasas ActiveScale Storage Cluster Will Provide I/O for
tions. Alternatively, such modeling could be used to feed World’s Fastest Computer. http://panasas.com/
simulations of system behavior or to verify at a coarse press_release_111306.html.
granularity DieCast performance predictions. [6] RUBiS. http://rubis.objectweb.org.

420 NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
[7] Vmware appliances. http://www.vmware.com/ [21] J. Mogul. Emergent (Mis)behavior vs. Complex Software
vmtn/appliances/. Systems. In Proceedings of the first EuroSys Conference,
[8] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, 2006.
and A. Muthitacharoen. Performance Debugging for Dis- [22] K. Nagaraja, F. Oliveira, R. Bianchini, R. P. Martin, and
tributed Systems of Black Boxes. In Proceedings of the T. D. Nguyen. Understanding and Dealing with Operator
19th ACM Symposium on Operating System Principles, Mistakes in Internet Services. In Proceedings of the 6th
2003. USENIX Symposium on Operating Systems Design and
[9] P. Barham, A. Doelly, R. Isaacs, and R. Mortier. Using Implementation, 2004.
Magpie for Request Extraction and Workload Modelling. [23] W. Norcott and D. Capps. IOzone Filesystem Benchmark.
In Proceedings of the 6th USENIX Symposium on Oper- http://www.iozone.org/.
ating Systems Design and Implementation, 2004. [24] D. Oppenheimer, A. Ganapathi, and D. Patterson. Why
[10] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, do Internet services fail, and what can be done about it. In
A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and Proceedings of the 4th USENIX Symposium on Internet
the Art of Virtualization. In Proceedings of the 19th ACM Technologies and Systems, 2003.
Symposium on Operating System Principles, 2003. [25] R. Pan, B. Prabhakar, K. Psounis, and D. Wischik.
[11] L. A. Barroso, J. Dean, and U. Holzle. Web Search for SHRINK: A Method for Scalable Performance Prediction
a Planet: The Google Cluster Architecture. IEEE Micro, and Efficient Network Simulation. In IEEE INFOCOM,
2003. 2003.
[12] J. M. Blanquer, A. Batchelli, K. Schauser, and R. Wol- [26] R. Ricci, C. Alfeld, and J. Lepreau. A Solver for the
ski. Quorum: Flexible Quality of Service for Internet Ser- Network Testbed Mapping Problem. In SIGCOMM Com-
vices. In Proceedings of the 3rd USENIX Symposium on puter Counications Review, volume 33, 2003.
Networked Systems Design and Implementation, 2005. [27] L. Rizzo. Dummynet and Forward Error Correction. In
[13] E. Cecchet, J. Marguerite, and W. Zwaenepoel. Perfor- Proceedings of the USENIX Annual Technical Confer-
mance and scalability of EJB applications. In Proceedings ence, 1998.
of the 17th ACM Conference on Object-Oriented Pro- [28] The MPI Forum. MPI: A Message Passing Interface.
gramming, Systems, Languages and Applications, 2002. pages 878–883, Nov. 1993.
[14] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and [29] A. Tridgell. Emulating Netbench. http://samba.
E. Brewer. Pinpoint: Problem Determination in Large, org/ftp/tridge/dbench/.
Dynamic Internet Services. In Proceedings of the 32nd In- [30] B. Urgaonkar, P. Shenoy, and T. Roscoe. Resource over-
ternational Conference on Dependable Systems and Net- booking and application profiling in shared hosting plat-
works, 2002. forms. In Proceedings of the 5th USENIX Symposium on
[15] Y.-C. Cheng, U. Hoelzle, N. Cardwell, S. Savage, and Operating Systems Design and Implementation, 2002.
G. M. Voelker. Monkey See, Monkey Do: A Tool for TCP [31] A. Vahdat, K. Yocum, K. Walsh, P. Mahadevan, D. Kostić,
Tracing and Replaying. In Proceedings of the USENIX J. Chase, and D. Becker. Scalability and Accuracy in a
Annual Technical Conference, 2004. Large-Scale Network Emulator. In Proceedings of the 5th
[16] R. Doyle, J. Chase, O. Asad, W. Jen, and A. Vahdat. USENIX Symposium on Operating Systems Design and
Model-Based Resource Provisioning in a Web Service Implementation, 2002.
Utility. In Proceedings of the USENIX Symposium on In- [32] C. A. Waldspurger. Memory Resource Management in
ternet Technologies and Systems, 2003. VMware ESX Server. In Proceedings of the 5th USENIX
[17] G. R. Ganger and contributors. The DiskSim Simu- Symposium on Operating Systems Design and Implemen-
lation Environment. http://www.pdl.cmu.edu/ tation, 2002.
DiskSim/index.html. [33] A. Warfield, R. Ross, K. Fraser, C. Limpach, and
[18] D. Gupta, K. V. Vishwanath, and A. Vahdat. DieCast: H. Steven. Parallax: Managing Storage for a Million Ma-
Testing Network Services with an Accurate 1/10 Scale chines. In Proceedings of the 10th Workshop on Hot Top-
Model. Technical Report CS2007-0910, University of ics in Operating Systems.
California, San Diego, 2007. [34] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad,
[19] D. Gupta, K. Yocum, M. McNett, A. C. Snoeren, G. M. M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An In-
Voelker, and A. Vahdat. To Infinity and Beyond: Time- tegrated Experimental Environment for Distributed Sys-
Warped Network Emulation. In Proceedings of the 3rd tems and Networks. In Proceedings of the 5th USENIX
USENIX Symposium on Networked Systems Design and Symposium on Operating Systems Design and Implemen-
Implementation, 2006. tation, 2002.
[20] A. Haeberlen, A. Mislove, and P. Druschel. Glacier:
Highly durable, decentralized storage despite massive
correlated failures. In Proceedings of the 3rd USENIX
Symposium on Networked Systems Design and Implemen-
tation, 2005.

USENIX Association NSDI ’08: 5th USENIX Symposium on Networked Systems Design and Implementation 421

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy