Twine OSDI20
Twine OSDI20
Twine OSDI20
1
A single control plane to manage one million machines. our web tier achieves 11% higher throughput by tuning OS
A region consists of multiple data centers, and a data center is settings. Twine leverages entitlements, our quota system, to
usually divided into clusters of tens of thousands of machines handle hardware and OS tuning. For example, an entitlement
connected by a high-bandwidth network. As with Borg [39] for a business unit may allow it to use up to 30,000 machines.
and Kubernetes [23], an isolated control plane per cluster We associate each entitlement with a host profile, a set of host
results in stranded capacity and operational burden because customizations that the entitlement owner can tune. Out of a
workloads cannot easily move across clusters. For example, shared machine pool, Twine dynamically allocates machines
power-hungry jobs colocated in a cluster can trigger power to entitlements and switches host profiles accordingly.
capping [26, 41], affecting service throughput until humans
move the problematic jobs to other clusters. Power-efficient machines. Facebook’s workloads have
Similarly, large-scale hardware refresh in a cluster may grown faster than our data center buildup. Power scarcity
result in idle machines and operational overhead. Our current motivated us to maximize performance per watt, either by
hardware refresh granularity is 25% of a data center. Figure 1 employing universal stacking on big machines or deploying
shows the duration for all owners of thousands of jobs to power-efficient small machines. We found it challenging to
migrate jobs out of a cluster prior to a hardware refresh in stack large workloads on big machines effectively. Further,
2016. The P50 is at 7.5 days and the P100 is at 87 days. A unlike a public cloud that needs to support diverse customer
large portion of the cluster sat idle in these ≈80 days while requirements, we only need to optimize for our internal work-
waiting for all jobs to migrate. loads. These factors led to us to adopt small machines with a
single CPU and 64GB RAM [32].
100%
Shared infrastructure. As we evolved Twine to support
% Closed Tickets
80%
60% large-scale shared infrastructure, we have been migrating our
40% workloads onto a single shared compute pool, twshared, and a
20% single shared storage pool. Twine supports both pools, but we
0%
0 10 20 30 40 50 60 70 80 90 focus on twshared in this paper. twshared hosts thousands of
Days to Close Work Tickets for Machine Refresh systems, including frontend, backend, ML, stream processing,
and stateful services. While twshared does not host durable
Figure 1: CDF of time to close job-migration work tickets. storage systems, it provides TBs of local flash to support
stateful services that store state derived from durable storage
To address the problems above, we scaled a single Twine systems. Figure 3 shows twshared’s growth.
control plane to manage one million machines across all data
centers in a region. Unlike Kubernetes Federation [25], Twine 60%
twshared / total fleet
Task 4 fails
Figure 3: Growth of twshared. twshared was created in 2013, but adoption
A B A C A C was limited in its first six years. We enhanced Twine and rebooted the
C D B B D D adoption effort in 2018. twshared hosts 56% of our fleet as of October 2020,
in contrast to 15% in January 2019. We expect that all compute services,
1 2 3 4 5 6 7 ≈85% of our fleet, will run on twshared by early 2022, while the remaining
15% will run in a separate shared storage pool.
Figure 2: Replicas of data shards A-D are distributed across tasks 1-7. Tasks 1
and 3 should not be restarted concurrently for a software upgrade, as shard A twshared has become our ubiquitous compute pool, as all
would lose two replicas and become unavailable. If the machine hosting task
new compute capacity lands only in twshared. We had broad
4 were to fail or be restarted for a kernel or firmware upgrade, the cluster
management system would need to ensure that neither task 1 nor 7 is restarted conversations with colleagues in industry and are unaware
concurrently in order to keep shard C available. of any large company that has achieved near 100% shared
infrastructure consolidation.
Twine provides a novel TaskControl API to allow appli-
The rest of the paper is organized as follow. §2 presents the
cations to collaborate with Twine in handling task lifecycle
design and implementation of Twine. §3 and §4 describe how
events that impact availability. For example, an application
we scale Twine to manage one million machines and do so
may postpone a task restart and rebuild a lost data replica first.
reliably. §5 evaluates Twine. §6 shares our experience with
Host-level customization. Hardware and OS settings may driving twshared adoption. §7 describes lessons learned. §8
significantly impact application performance. For example, summarizes related work. Finally, §9 concludes the paper.
2
Service Application-Level Schedulers
Service Owner’s
Conveyor Resource with built-in TaskControllers
▪ negotiate task restart/move TaskController
Manager (SRM) (batch/ML/stream/stateful)
▪ notify unavailability events
au resize perform application
re tom st TaskControl
jobs ue s level drain
le a
as te req urce
o
es res
Region Data Center Managed Machine
Front End Scheduler
manag Task Task
e tasks
continuous optimization allocate tasks
through task moves
mark switch host
ReBalancer Allocator Sidekick profile Agent
machine
usage
specify entitlement monitor availability
capacity
schedule Resource enable/disable Health Check
Capacity Portal Ops Planner maintenance machines
Broker (RB) Service (HCS)
service accounting
hierarchy
Figure 4: The Twine Ecosystem. Note a potential terminology confusion. The Twine scheduler corresponds to the Kubernetes [23] controllers, whereas the
Twine allocator corresponds to the Kubernetes scheduler.
2 Twine Design and Implementation accounting hierarchy. With a granted entitlement, a user de-
ploys jobs through the front end. The scheduler manages
Facebook currently operates out of 12 geo-distributed regions, job and task lifecycle, e.g., orchestrating a job’s software re-
with several more under construction. Each region consists lease. If a job has a TaskController, the scheduler coordinates
of multiple data center (DC) buildings. A main switchboard with the TaskController to make decisions, e.g., delaying a
(MSB) [41] is the largest fault domain in a DC with sufficient task restart to rebuild a lost data replica first. The allocator
power and network isolation to fail independently. A DC assigns machines to entitlements and assigns tasks to ma-
consists of tens of MSBs each powering tens of rows that feed chines. ReBalancer runs asynchronously and continuously to
tens of racks of servers as shown in Figure 5. improve the allocator’s decisions, e.g., better balancing the
utilization of CPU, power, and network. Resource Broker
Region (RB) stores machine information and unavailability events
Data Center that track hardware failures and planned maintenance. DC
Main Switchboard (MSB) operators schedule planned maintenance through Ops Plan-
Power Row ner. The Health Check Service (HCS) monitors machines and
Rack Main Data
updates their status in RB. The agent runs on every machine
Machine
Power Switchboard Center to manage tasks. Sidekick switches host profiles as needed.
Rack Row (MSB) Service Resource Manager (SRM) autoscales jobs in response
Machine to load changes. Conveyor is our continuous delivery system.
2.2 Entitlements
Figure 5: Data center topology.
Conceptually, an entitlement is a pseudo cluster that uses
Historically, a cluster was a subunit within a DC consist- a set of dynamically allocated machines to host jobs. An
ing of about ten thousand machines connected by a high- entitlement grants a business unit a quota expressed as a count
bandwidth network and managed by an isolated Twine con- of machines of certain types (e.g., 2,000 Skylake machines)
trol plane. Over time, our network transitioned to a fabric or as Relative Resource Units (RRU) akin to ECU in AWS.
architecture [2, 14] that provides high bandwidth both within A machine is either free or assigned to an entitlement, and
a DC and across DCs in a region, empowering a single Twine it can be dynamically reassigned from one entitlement to an-
control plane to manage jobs across DCs. other. An entitlement can consist of machines from different
DCs in a region. Existing cluster management systems bind a
job to a physical cluster. In contrast, Twine binds a job to an
2.1 Twine Ecosystem entitlement. Jobs in an entitlement stack with one another on
Figure 4 shows an overview of Twine. The Capacity Portal machines assigned to the entitlement.
allows users to request or modify entitlements, which asso- By default, Twine spreads tasks of the same job across DCs
ciate capacity quotas with business units defined in the service and MSBs as shown in Figure 6. This reduces buffer capacity
3
needed for fault tolerance [29]. Suppose a job’s tasks are Entitlements help automate job movements across clusters.
1
spread across 12 MSBs in one DC. We need 12 ≈ 8.3% of Consider a cluster-wide hardware refresh. We first add new
buffer capacity to guard against the failure of one MSB. If the machines from other clusters into the regional free machine
job’s tasks are spread across five DCs’ 60 MSBs, the needed pool (see the right side of Figure 7). Then the allocator moves
1
buffer reduces to 60 ≈ 1.7%. For workloads that require better tasks from machines undergoing hardware refresh to new
locality for compute and storage, Twine allows an entitlement machines acquired from the free machine pool, requiring no
to override the default spread policy and pin its machines and actions from the job owner. To migrate a task, Twine stops
jobs to a specific DC. These workloads are in the minority. the task on the old machine and restarts it on the new machine.
We do not use live container migration.
Region
Entitlement 1 Entitlement 2
2.3 Allocator
M1 M3 M4 M2 M5 M6
One instance of Resource Broker (RB) is deployed to each DC.
RB records whether a machine in the DC is free or assigned to
A A A an entitlement. A regional allocator fetches this information
C C C
B B from all RBs in the same region, maintains an in-memory
M1 M2 M3 M4 M5 M6
write-through cache, and subscribes to future changes.
Main Switchboard (MSB) 1 Main Switchboard (MSB) 2 The scheduler calls the allocator to perform a job allocation
when a new job starts, an existing job changes size, or a
Figure 6: Entitlement example. Entitlement 1 consists of machines M1, machine fails. The allocation request contains an entitlement
M3, and M4 from different MSBs. Jobs A and B are bound to Entitlement ID, an allocation policy, and a per-task map of which tasks
1, and job C is bound to Entitlement 2. Jobs A and B stack their tasks on need to be allocated or freed. The allocation policy includes
machine M3. As job C grows, Twine adds machine M6 to Entitlement 2. hard requirements (e.g., using Skylake machines only) and
The allocator assigns machines to entitlements, and it also soft preferences (e.g., spreading tasks across fault domains).
assigns tasks to machines in an entitlement. For an entitle- The allocator maintains an in-memory index of all ma-
ment with a quota of N machines, the number of machines chines and their properties to support hard requirement
actually assigned to the entitlement may vary between 0 and queries, such as “all Skylake machines with available
N, depending on the actual needs of jobs running in the enti- CPU ≥ 2RRU and available memory ≥ 5GB.” It needs to
tlement. Figure 7 depicts an example of how an entitlement search machines beyond the ones already assigned to the en-
changes over time. titlement because it may need to add more machines to the
entitlement to host the job. After applying hard requirements,
Machines allocated
to the entitlement
Free machines
in the region
it applies soft preferences to sort the remaining machines.
A soft preference is expressed as a combination of 1) a ma-
Time
(None) (3) chine property to partition machines into different bins with
the same property value, and 2) a strategy to allocate tasks to
M7 D
Start Job D (2) these machine bins. For example, the allocator spreads tasks
across fault domains by using a soft preference with fault
Start Job E M7 D E M8 E (1) domain as the machine property, and the strategy that assigns
tasks evenly to the machine bins that represent fault domains.
Stop Job E M7 D (2) The allocator uses multiple threads to perform concurrent
allocations for different jobs, and relies on optimistic con-
Figure 7: Allocation of machines and tasks. Initially, no machine is assigned currency control to resolve conflicts. Before committing an
to the entitlement. When job D starts, the allocator assigns machine M7 to the allocation, a thread verifies that all impacted machines still
entitlement. When job E starts, the allocator stacks one task on M7 and adds
machine M8 to the entitlement to run E’s other task. When job E stops, the have sufficient resources left for the allocation. If the verifica-
allocator returns M8 to the free machine pool for use by other entitlements. tion fails, it retries a different allocation.
To avoid repeating the costly machine selection process,
We optimized the allocator to make quick decisions when the allocator caches the allocation results at the job level. The
starting tasks; this optimization limits computation time and allocator invalidates a cache entry if the job allocation request
leads to best-effort outcomes. The addition or removal of ma- changes or the properties of the machines hosting the tasks
chines and workload evolution may result in hotspots in CPU, change. The cache hit ratio is typically above 99%.
power, or network. ReBalancer runs asynchronously and
continuously to improve upon the allocator’s allocation deci- 2.4 Scheduler
sions by swapping machines across entitlements or moving
tasks across machines. ReBalancer uses a constraint solver to The scheduler manages the lifecycle of jobs and tasks. As
perform these time-consuming global optimizations. the central orchestrator, the scheduler drives changes across
4
Twine components in response to different lifecycle events, service TaskController {
TaskControlResponse process(TaskControlRequest request);
including hardware failures, maintenance operations, power }
capping [41], kernel upgrades, job software releases, job struct TaskControlRequest {
string jobHandle;
resizing, task canary, and ReBalancer moving tasks. list<> request; // Pending task operations to be approved.
The scheduler handles a machine failure as follows. When list<> completed; // Completed task operations.
list<> advanceNotices; // Upcoming planned maintenance.
the Health Check Service detects a machine failure, it creates list<> allUnhealthyTasks; // Tasks unhealthy due to any reason.
an unavailability event in Resource Broker, which notifies int sequenceNumber; // Increase after each call.
}
the allocator and scheduler. The scheduler disables the af- struct TaskControlResponse {
fected tasks in the service discovery system so that clients list<> ack; // Approved task operations.
}
stop sending traffic to these tasks. A job is impacted by the
machine failure if it has tasks running on the machine. If an (a) TaskControl API.
impacted job has a TaskController, the scheduler informs the
TaskController of the affected tasks. After the TaskController Scheduler TaskController
acknowledges that these tasks can be moved, the scheduler S0 Update Job
requests the allocator to deallocate the tasks and allocate new Time
request=[t0, t1] completed=[ ]
instances of the tasks on other machines. The scheduler in- S1
ack=[t1]
structs agents to start the new tasks accordingly. Finally, the request=[t0] completed=[t1]
S2
scheduler enables the tasks in the service discovery system ack=[t0]
60%
the worst case experiences n leader failovers during a release.
40%
We designed the TaskControl API to allow applications to 20%
collaborate with Twine when deciding which task operations 0%
1 10 100 1000 10000
to proceed and which to postpone, as depicted in Figure 8. Number of Services (Log Scale)
Unlike software releases, maintenance events like a power
device replacement cannot be blocked indefinitely by a Figure 9: CDF of machines used by services. A small number of services
TaskController; the scheduler gives the TaskController ad- dominate the capacity consumption. Note that the x-axis is in log scale.
vance notices with a deadline to react. Upon reaching the
deadline, the scheduler stops the remaining tasks on the af- Our efficiency effort focuses on these large services, and we
fected machines, allowing maintenance to proceed. Before find that host-level customization is important for maximizing
the deadline, a TaskController has multiple options: 1) move their performance. For example, customizations help our
the tasks to other machines, 2) stop the tasks on the current large web tier achieve 11% higher throughput. However,
machine and restart them after the maintenance completes, some custom settings may be beneficial for one service but
or 3) do nothing and keep the tasks running. For example, a detrimental to another. As an example, a combination of
top-of-rack switch maintenance typically incurs only a few explicit 2MB and 1GB hugepages improves the web tier’s
minutes of network downtime, and a stateful service may pre- throughput by 4%; however, most services are incapable of
fer option 3 because rebuilding a data replica elsewhere takes utilizing explicit hugepages and enabling this setting globally
longer than the maintenance itself. would lead to unusable memory.
5
We resolved the conflict between host-level customization 256GB RAM, and a dedicated NIC. Under the same rack-
and sharing machines in a common pool via host profiles, level power budget, a rack holds either 92 small machines or
a framework to control host-level customizations on entitle- 30 big machines. A small-machine rack delivers 57% higher
ments. An entitlement is associated with one host profile; total compute capacity measured in RRU. Averaged across
all machines in the entitlement share the same host profile. all our services, using small machines led to 18% savings in
When a machine is reassigned from one entitlement to an- power and 17% savings in total cost of ownership (§5.4).
other, Sidekick automatically applies the target entitlement’s We are consolidating all our compute services onto small
host profile. By fully automating the process of machine al- machines, as opposed to offering a variety of high-memory or
location and host customization in our shared infrastructure, high-CPU machine types. This unification simplifies down-
we can perform fleet-wide optimizations (e.g., swapping ma- stream supply chain and fleet management. It also improves
chines across entitlements to eliminate hotspots in network machine fungibility across services, as we can easily reuse a
or power) without sacrificing workload performance. Sup- machine across all compute services. Our consolidation jour-
ported host profile settings include kernel versions, sysctls ney has been challenging (§7.4), as some services initially did
(e.g., hugepages and kernel scheduler settings), cgroupv2 not fit the limited 64GB in our small machines. To address
(e.g., CPU controller), storage (e.g., XFS or brtfs), NIC set- this, we used several common software architectural changes:
tings, CPU Turbo Boost, and hardware prefetch. • Shard a service so that each instance consumes less mem-
ory. Our Shard Manager platform (§2.7) helps develop-
2.7 Application-Level Schedulers ers easily build sharded services running on Twine.
• Exploit data locality to move in-memory data to an ex-
As shown at the top of Figure 4, multiple application-level
ternal database and use the smaller memory as a cache.
schedulers are built atop Twine to better support vertical work-
loads such as stateful [16], batch [21], machine learning [13], • Exploit data locality to provide tiered memory on top of
stream processing [28], and video processing [18]. Twine 64GB RAM and TBs of local flash. For example, when
provides containers as resources for these application-level migrating TAO [7], our social graph cache, from big
schedulers to manage and delegates task lifecycle manage- machines to small machines, CacheLib [5] transparently
ment to them through TaskControl. provided tiered memory to improve cache hit ratio and
Shard Manager (SM) [16] is an example of an application- reduce load on the external database by ≈30%.
level scheduler. It is widely used at Facebook to build sharded
Our largest services fully utilize small machines without
services like the one in Figure 2. It has two major components:
stacking. We rely on Autoscaling to free up underutilized
the SM client library and the SM scheduler. The library
machines. Active Last Minute (ALM) is the number of people
is linked into a sharded service and provides two APIs for
who use our online products within a one-minute interval.
the service to implement: add_shard() and drop_shard().
The load of many services correlates with ALM. Service
The SM scheduler decides the shards each Twine task will
Resource Manager (SRM) uses historical data and realtime
host and calls the service’s add_shard() implementation to
measurements to continuously adjust task count for ALM-
prepare the task to serve requests for those shards. To balance
tracking services and frees up underutilized machines in their
load, SM may migrate a shard from task T1 to task T2 by
entirety for other workloads to use. This work has allowed us
informing T1 to drop_shard() and T2 to add_shard().
to successfully build a large-scale shared infrastructure that
The SM scheduler integrates with Twine through TaskCon-
consists primarily of small machines.
trol and can handle the complex situations depicted in Fig-
ure 2. In another example, Twine gives SM advance notice
3 Scaling to One Million Machines
about an upcoming maintenance on a machine. If the mainte-
nance duration is short and the shards hosted by the machine We designed Twine to manage all machines that can fit in a
have replicas elsewhere, SM may do nothing; otherwise, SM region’s 150MW power budget. Although none of our regions
may migrate the impacted shards out of the machine. host one million machines yet, we are close and anticipate
reaching that scale in the near future. Two principles help
2.8 Small Machines and Autoscaling Twine scale to one million machines and beyond: 1) sharding
as opposed to federation, and 2) separation of concerns.
To achieve higher performance per watt, our server fleet uses
millions of small machines [32], each with one 18-core CPU
3.1 Scale Out via Sharding
and 64GB RAM. We have worked with Intel to define low-
power processors optimized for our environment, e.g., re- To scale out, we shard Twine schedulers by entitlements, as
moving unneeded NUMA components. Four small machines depicted in Figure 10. We assign newly-created entitlements
are tightly packed into one sled, sharing one multi-host NIC. to shards with the least load. Entitlements can change size
They are replacing our big machines, each with dual CPUs, and can migrate across shards. If a shard becomes overloaded,
6
CPU Usage (cores)
40
Twine can transparently move an entitlement in the shard
30
to another shard without restarting tasks in the entitlement. 20
Twine can also migrate an individual job from one entitlement 10
to another. To do this, Twine performs a rolling update of the 0
job until all tasks restart on machines belonging to the new 10 100 1000 10000
entitlement. We automate the execution of these migrations, Managed Machines in Logscale (thousands)
but humans still decide when and what to migrate. Since Figure 12: P99 CPU usage of production allocators over one week.
migrations happen rarely, we do not automate these further.
Front End
Each data point in Figure 12 plots the P99 CPU utiliza-
tion of a production allocator. At its peak, a large allocator
E1 E2 E3 E4 E5 E6 E7 E8 performs ≈1,000 job allocations per second, with an aver-
Scheduler Shard 1 Scheduler Shard 2 Scheduler Shard 3 age job size of 36 tasks. We run a few deployments of the
scheduler and allocator at the global level to manage machines
Allocator and jobs across multiple regions (§7.3). Our largest global
allocator currently manages more than one million machines
across regions. The allocator is scalable because it has a high
Resource Resource Resource Resource
Broker, DC1 Broker, DC2 Broker, DC3 Broker, DC4
cache hit ratio (§2.3), does not handle allocations for short-
lived batch jobs (§3.2), and does not perform time-consuming
Shared Regional Free Machine Pool optimizations (§3.2).
40
30 mizations such as balancing CPU, network, and power.
20 Separation of responsibilities between Twine and
10 application-level schedulers helps Twine scale further.
0 Application-level schedulers handle many fine-grained re-
0 50 100 150 200 source allocation and lifecycle operations without involving
Managed Machines (thousands)
Twine. For example, the Twine scheduler and allocator do not
Figure 11: P99 CPU usage of production scheduler shards over one week. directly manage batch jobs, whose lifetime might last just a
few seconds and cause high scheduling loads. The application-
The simplicity of scheduler sharding comes with a theoreti- level batch scheduler acquires resources from Twine in the
cal limitation: a single job must fit in a single scheduler shard. form of Twine tasks. It reuses these tasks over a long pe-
This is not a practical limitation. Currently, the largest sched- riod of time to host different batch jobs, avoiding frequent
uler shard manages ≈170K machines; the largest entitlement host profile changes. The batch scheduler can create nested
uses ≈60K machines; and the largest job has ≈15K tasks. containers inside the tasks, similar to that in Mesos [17].
7
3.3 Comparison of Sharding and Federation out because a job is exclusively managed by one scheduler
shard, and Resource Broker provides a simple interface to
We acknowledge that Twine’s scale of managing millions of manage the shared regional pool of machines.
machines is not unique, as Borg [39] and several public clouds
likely manage infrastructure of that scale as well; however, we
4 Availability and Reliability
believe that Twine’s approach is unique. Other cluster man-
agement systems scale out by deploying one isolated control Compared with the traditional approach of deploying one
plane per cluster and operate many siloed clusters. They pre- control plane per cluster, Twine’s regional control plane incurs
allocate machines to a cluster; once a job starts in a cluster, it additional risks: 1) a control plane failure may impact all
stays with the cluster. This lack of mobility results in stranded jobs in a region as opposed to just a cluster, and 2) network
capacity when some clusters are overloaded while others are partitions may result in a regional Twine scheduler unable to
idle. It also causes operational burden during cluster-wide manage an isolated DC.
maintenance such as hardware refresh, as shown in Figure 1.
To avoid stranded capacity, we can introduce mobility by Design principles. We observe several design principles to
moving either jobs or machines. To that end, the federation mitigate the risks listed above.
approach (e.g., Kubernetes Federation [25]) allows a job to • All components are sharded: Each shard manages a
be split across multiple static clusters, whereas Twine dynam- small fraction of machines and jobs in a region, limiting
ically moves machines in and out of entitlements. Figure 13 the impact of a shard failure. Assuming Twine uses
compares these two approaches. 20 scheduler shards to manage a 150MW region, each
scheduler shard manages 7.5MW worth of machines,
Federation Manager Federation Manager
which is no bigger than a traditional cluster.
Job B grows
Cluster Cluster
into Cluster 2 as
Cluster Cluster • All components are replicated: Consider schedulers
Cluster 1 runs
Manager Manager out of available Manager Manager for example: replicas of a scheduler shard sit in differ-
A C
machines
A C
ent DCs and elect a leader to process requests. If the
leader fails or its network is partitioned from other DCs,
B B
a follower in another DC becomes the new leader.
B B
• Tasks keep running: Even if all Twine components
Cluster 1 Cluster 2 Cluster 1 Cluster 2 fail, existing tasks continue to run. New jobs cannot
(a) Federation approach. This approach uses a Cluster Manager per cluster and intro- be created and existing tasks cannot be updated until
duces an additional Federation Manager layer. Each cluster has a set of statically con-
figured machines. As job B in Cluster 1 keeps growing, it overflows into Cluster 2. Twine recovers. If a DC is partitioned from the scheduler,
existing tasks in the DC continue to run.
Entitlement 1 • Rate-limit destructive operations: It is possible that
B
a bug or fault might cause Twine to perform a large
Entitlement 1
Job B grows and
B
number of destructive operations quickly, e.g., shuffling
B Entitlement 2 stays in Entitlement 1
as more machines
B Entitlement 2 tasks across machines at a fast pace. We protect against
A C are added to
Entitlement 1
A C
this failure by ensuring all components have fail-safe
Scheduler & Scheduler & Scheduler & Scheduler & mechanisms to rate-limit destructive operations.
Allocator Allocator Allocator Allocator
• Network redundancy: Fabric Aggregator connects our
data centers in a region and can “suffer many simulta-
Resource Broker Resource Broker
(b) Twine’s sharding approach. As job B grows, Twine adds more machines to within-region network partitioning as a major challenge.
Entitlement 1, and job B stays with the same entitlement and scheduler shard.
Figure 13: The two figures above contrast how federation and sharding sup- Operational principles. In addition to the design principles
port a job growing over time without stranding capacity in isolated clusters. listed above, we observe several operational principles.
• Twine manages itself: To avoid developing yet another
The federation approach can support complex multi-region, cluster management tool to manage Twine installations,
hybrid-cloud, or multi-cloud deployments, but it adds com- all Twine components, except for the agent, run as Twine
plexity as a scale-out solution. In order to provide a seamless jobs. We developed automation to bootstrap the Twine
user experience, the Federation Manager in Figure 13a has to ecosystem starting from scratch. The Twine agent has
perform complex coordination for a job whose metadata and no dependencies on other Twine components and our
management operations are split among multiple distributed bootstrapping mechanism directly sends commands to
Cluster Managers. In contrast, Twine is simpler for scaling agents to start other Twine components as Twine tasks.
8
• Twine manages its dependencies: As we built confi-
Task restart
750 Requested but not acked task restarts
requests
600
dence in Twine’s bootstrapping automation, we ran all 450
A batch of just acked task restarts
replicas down
12500 1 replica down
Shards with
• Gradual but frequent software release: A new release 10000 2 replicas down
progresses gradually across regions and shards so that 7500
5000
a bug does not hit the entire fleet instantaneously. All 2500
components are released weekly or more frequently to 0 Failure Duration
0 150 300 450 600 750 900 1050 1200
lower the risk associated with large changesets.
Time (seconds)
• Recurring large-scale failure test [38]: This happens
(b) Shards with some replicas down.
regularly in production to verify Twine’s reliability.
Figure 14: TaskControl helps a stateful service uphold its availability in the
These principles help us run Twine reliably. We share one event of a concurrent software release and hardware failures.
anecdote where rate-limiting mitigated the risk caused by
the complex interplay of four concurrent events: 1) shifting
traffic from region X to region Y , 2) performing a load test Let Tx denote the moment of x seconds into the experiment.
in region Y , 3) adding new server racks to region Y before At T0, the user initiates a rolling update of the service. In Fig-
removing old racks, and 4) software upgrade for the web tier. ure 14a, at T0, the TaskController allows 274 tasks to update
The first three events led to increased power consumption concurrently (the bottom curve). It does not allow any of the
in region Y and power capping on many machines. The other 726 tasks to update (the top curve) because that would
scheduler rate-limited the number of tasks moving away from cause some shards to lose their second replicas. In Figure 14b,
power-capped machines. This rate-limiting halted the web at T0, 12,264 shards lose one replica (the top curve) because
tier’s software upgrade and protected against further loss of they are hosted by the 274 tasks undergoing update. No shard
capacity. In this incident, rate-limiting provided a safety net loses its second replica (the bottom curve) because of the
before we debugged the incident. TaskController’s precise shard availability calculation.
During the Failure Duration in the figures (between
5 Evaluation T120 and T415), we inject the failure of one MSB that kills
50 tasks causing 1,292 shards to lose their second replicas,
Our evaluation answers the following questions: because those shards are also hosted by the 274 tasks under-
1. How does TaskControl deal with complex scenarios that going update. The spike in Figure 14b’s bottom curve reflects
impact an application’s availability? the impact on the 1,292 shards.
By T240, the 274 tasks are updated and become healthy.
2. How effective is autoscaling for production use?
As a result, even if the 50 tasks in the failed MSB are still
3. How effective are host profiles in improving perfor- down, shards with 2 replicas down drop to zero (the bot-
mance? What is the overhead of switching host profiles? tom curve in Figure 14b). At T240, the TaskController care-
4. How cost effective are small machines in replacing big fully selects the second batch of 214 tasks to update, ensuring
machines? no overlap between the shards hosted by the 214 tasks and
the shards hosted by the 50 tasks in the failed MSB (see Ack
excludes failed shards in Figure 14a). This careful task
selection keeps Figure 14b’s 2 replicas down curve at zero
5.1 TaskControl
throughout the rest of the experiment.
Figure 14 demonstrates how TaskControl handles the complex
situation of a software release and machine failures happen- 5.2 Autoscaling
ing concurrently. This experiment uses a caching service
managed by Shard Manager (§2.7). The cache’s data are par- Currently, we autoscale ≈800 services. Figure 15 shows the
titioned into 15,000 shards, and each shard runs three replicas. efficacy of autoscaling on our web tier, which is our largest
The 45,000 shard replicas are hosted by 1,000 Twine tasks. service. Autoscaling frees up to 25% of the web tier’s ma-
Shard Manager’s TaskController helps minimize the risk of a chines during off-peak hours. The bottom curve represents
shard losing more than one replica, i.e., driving Figure 14b’s the web tier’s CPU utilization. The middle curve represents
2 replicas down curve towards zero. the web tier’s real job size, i.e., the number tasks in the job.
9
lowering the packet sampling rate and disabling certain packet
marking. Addl’t sysctl further tunes 17 CPU schedul-
ing and network settings, where improvements in reliability
Recommended job size
are more important than the mild performance gains. For
Real job size example, based on lessons from past incidents, we tuned
CPU utilization of the web tier (different y-axis) net.ipv4.tcp_mem to alleviate TCP’s memory pressure un-
Mar 12 Mar 13 Mar 14 Mar 15 Mar 16 Mar 17 der high loads in order to prevent cascading failures.
Figure 15: Autoscaling the web tier. The CPU spikes are caused by the Overhead of switching host profiles. Figure 17 shows the
continuous-delivery process restarting tasks. host profile switching time. We discuss both ends of the
performance spectrum. The P90 for enabling CPU Turbo
takes 3.0 seconds. The P90 for enabling HugePages takes
The top curve represents autoscaling’s recommendation for 244 seconds, as memory fragmentation sometimes causes
the job’s ideal size. The CPU utilization closely follows the the Linux kernel to fail to allocate hugepages and a machine
recommended job size, demonstrating the prediction’s ac- reboot may be needed to finish the operation. To alleviate the
curacy. Usually, the real job size also closely follows the problem, we recently developed a kernel improvement [33]
recommended job size, but we intentionally choose a week that achieves above 95% success rate for hugepage allocation;
when they diverged during peak hours. we are still in the process of deploying it to production.
During the week of March 12, 2020, our online products
experienced a drastic traffic growth [19] related to COVID-19, 244.0
Time (seconds)
causing a temporary capacity shortage. As a result, the real 100 27.9
(log scaloe)
job size could not grow to follow the recommended job size 7.3 8.5 12.2
10 3.0 3.6 4.1 5.1 5.6
during peak hours. The web tier’s TaskController adapted 1
to this unexpected situation without any manual interven- CPU A B C D E F G H Huge
Turbo Pages
tion. During peak hours, it advanced the continuous-delivery
software releases more slowly, bringing down fewer tasks con- Figure 17: P90 host profile switching time for different host profiles.
currently to limit temporary capacity losses. During non-peak
hours, it advanced software releases at a normal pace. On average, a machine changes its host profile once every
two days; hence the overall overhead is negligible. Figure 18
depicts how autoscaling impacts host profile changes.
5.3 Host Profiles
12% 10.2%
over baseline
Improvement
9.7%
10%
8%
6%
4% 1.6%
2%
0%
CPU affinity addt'l BPF cfg addt'l sysctl
Host Profile Configuration
10
• S: A service needs S number of small machines to re- • Capacity buffer consolidation. As services migrated
place a big machine and achieve the same performance. into twshared, we consolidated siloed buffers for soft-
• S ware releases, maintenance, fault tolerance, and growth
B:Relative TCO (RTCO) of a service running on small
machines vs. on big machines. into centralized buffers, improving utilization by ≈3%.
• Turbo Boost. We aggressively enabled Turbo on proces-
Figure 19 shows the RTCO of 22 fleet-wide representative sor cores and relied on ReBalancer to mitigate power
services. One service has worse than 100% RTCO, seven hotspots, improving utilization by ≈2% in 2020.
use the maximum prescribed 100% RTCO, and a majority of • Autoscaling. Autoscaling freed up over-provisioned ca-
services are able to achieve a better RTCO. pacity, reclaiming ≈2% of capacity in 2020.
110%
95% 97% 100%100%100%100%100%100%100%
Relative TCO
67%
73% 73% 76% 77%
80% 83%
88% 93% 93%
As shown in Figure 20, as of October 2020, twshared’s
33%
48% average memory and CPU utilization are ≈40% and ≈30%,
respectively. For comparison, the figure also shows utilization
Different Services for private pools, our legacy pools of customized machines
dedicated to individual workloads. We plan to improve utiliza-
Figure 19: The relative total cost of ownership of services running on small
tion through multiple approaches, such as the one described
machines vs. on big machines. Smaller numbers mean bigger savings.
below. Our fleet is dominated by user-facing services that
provision capacity for peak load. Autoscaling frees some of
The first service in Figure 19 achieves a low 33% RTCO this over-provisioned capacity during off-peak hours and pro-
by adopting Shard Manager (§2.7). The service is sharded; vides it as opportunistic capacity for other workloads to use.
its biggest shard has 20x higher load than its smallest shard Unfortunately, we do not yet provide service-level objectives
and the load varies. The service’s previous static-sharding (SLOs) on the availability of opportunistic capacity, which
solution did not work well, whereas Shard Manager is able is limiting adoption and usage of all available capacity. As
to balance the load via shard migration. After switching to we establish SLOs for opportunistic capacity, improve stack-
small machines, the service better utilizes the overall higher ing, and consolidate capacity buffers, we expect twshared’s
CPU count of small machines under the same TCO. utilization to increase.
The second service achieves a 48% RTCO by moving from
50%
an in-memory data store to an external flash-based database.
Its 48% RTCO includes the cost of the database, which is 40%
% Utilization
The service with 76% RTCO is TAO [7], our social graph 20%
cache. CacheLib [5] provides transparent tiered memory on 10% twshared memory private pool memory
top of 64GB RAM and TBs of local flash to replace 256GB twshared CPU private pool CPU
0%
RAM (§2.8). Its 76% RTCO includes the cost of flash. Apr May Jun Jul Aug Sep Oct
Month in 2020
One outlier service has 110% RTCO, meaning it costs 10%
more to run on small machines. The memory is used to store Figure 20: Daily average CPU and memory utilization of twshared and
certain data indices and ML models that rank the indices. private pools circa October 2020.
We are improving the service to target 90% RTCO, e.g., by
leveraging CacheLib [5] to provide tiered memory.
Across all services in our fleet beyond the examples in
6.2 Path to Shared Infrastructure
Figure 19, we achieved an average 83% RTCO, i.e., 17% fleet- We had broad conversations with colleagues in industry and
wide TCO savings. This also includes 18% power savings. learned that while partial consolidation of workloads is com-
Overall, we have been successful at using small machines. mon, no large company has achieved near 100% shared in-
frastructure consolidation. Further, we learned that cultural
6 Experience with Shared Infrastructure challenges are as significant as technical challenges. Below,
we describe our strategy and major milestones towards mi-
As described in §1, Twine has allowed us to grow twshared, grating all non-storage workloads into twshared.
our shared compute pool, from ≈15% in 2019 to ≈56% in
Make Twine capable of supporting a large shared pool.
2020. We share our experience with growing twshared.
Scalability, entitlements, host profiles, and TaskControl are
Twine’s important features that enabled workload consolida-
6.1 Economies of Scale in twshared tion. The flexibility offered by host profiles and TaskControl
ensures that twshared can support both 1) the general needs
Shared infrastructure provides economies of scale by reducing
of thousands of services, and 2) the specialized needs of a
hardware, development, and operational costs. Examples:
smaller set of services that consume the majority of capacity.
11
Private pools not Private pools
100%
Spare managed by Twine managed by Twine Twshared
80%
% Machines
(8%) (14%) (65%) (13%)
60%
40%
20%
0%
0 100 200 300 400 500 600 700 800
Number of Entitlements for PGx Services
Figure 22: CDF of PGx entitlement size. The distribution is highly skewed.
The largest 54 entitlements account for 70% of PGx capacity in twshared.
Figure 21: Breakdown of machines in our fleet as of August 2018. Each
small rectangle inside a category represents a private pool, and its size
is proportional to the number of machines in the private pool. There were
hundreds of private pools, many of which were small in size. The percentages region. Figure 22 shows the size distribution with the biggest
at the top reflect the number of machines in each category relative to all entitlement running ≈2K jobs on ≈15K machines.
machines globally. From August 2018 to October 2020, the breakdown Accommodating workload-specific requirements helps on-
evolved from [8%, 14%, 65%, 13%] to [13%, 5%, 26%, 56%], where the
numbers match the left-to-right categories in the figure.
board PGx services onto twshared. For instance, many PGx
services run A/B tests in production, e.g., to evaluate the ef-
fectiveness of a new model–these services need to explicitly
Publicize the growth and health of twshared. We devel- configure the processor generation for their tasks to prevent
oped a tool to show the realtime breakdown of our fleet and performance variations between hardware types from pollut-
the growth of twshared. A snapshot is shown in Figure 21. ing their test results.
We consolidated the fragmented mechanisms of measuring The capacity guaranteed by entitlements and private pools
machine health into the Health Check Service. Continuous account for 55% of PGx machines. The remaining 45% are
improvements have resulted in twshared running healthier from opportunistic sources including capacity buffers, ma-
than private pools, 99.6% vs. 98.3%. chines freed up by autoscaling, and unused portions of other
teams’ entitlements. Optimus is an application-level sched-
Set a strong example for others to follow. Early on, we
uler that runs atop Twine to manage opportunistic capacity.
targeted the web tier, our largest private pool. It directly serves
When opportunistic capacity is not available, some services
external users of our company’s products and any outage
gracefully degrade their quality of service.
would be immediately noticeable. We finished migrating the
Jobs with a TaskController consume 36% of PGx capacity
web tier into twshared mid-2019. As the web tier team is
in twshared; in total these jobs use three different TaskCon-
highly respected in the company, their testimony motivated
trollers, including the one from Shard Manager [16]. About
others to follow.
95% of PGx capacity is consumed by entitlements that use
Make migration mandatory. After the web tier migration, some combination of these three host profile settings:
we gained company-wide support for mandatory migration. 1. If a service does frequent flash writes, it prefers the flash
Further, we established that all new compute capacity will drive to expose only a fraction of the flash capacity in
land only in twshared. This mandate, along with Twine’s order to reduce write amplification and burn rate.
flexibility of supporting customization through TaskCon- 2. If a service can fully utilize a whole machine and does
trol and host profiles, has made twshared our ubiquitous not stack with other services, we disable the cgroup2
compute pool. CPU controller to eliminate its overhead.
3. Because our data centers are power constrained and CPU
Turbo consumes extra power, we enable Turbo only for
6.3 Case Study of twshared Migration
services that can benefit significantly from Turbo and are
PGx is a large product group that runs hundreds of diverse running in selected data centers with sufficient power.
services on hundreds of thousands of machines. Their services Overall, our experience with PGx indicates that, despite
vary in size from a few machines to tens of thousands, and the significant upfront effort needed for migration, even large
in complexity from computationally intensive ML training and varied services are motivated to adopt shared infrastruc-
to latency-sensitive ad delivery. Previously, their fleet was ture that reduces their operational burden. PGx ’ success in
fragmented into tens of private pools per region. The first using opportunistic capacity at a large scale has spurred us to
PGx service migrated into twshared in January 2020; as of develop SLO guarantees and drive broader adoption (§6.1).
September 2020, more than 70% of PGx machines run in Entitlements, TaskControl, and host profiles enable customiza-
twshared. Given the size and diversity of their services, we tion in a shared pool and were the features that enabled the
expect the migration to finish in late 2021. migration. On the other hand, PGx services have grown to
PGx services use hundreds of twshared entitlements; if a hundreds of entitlements within 9 months, motivating us to
service runs in multiple regions, it needs one entitlement per address entitlement fragmentation (§7.1).
12
7 Lessons Learned 7.3 Supporting Global Services
Evolving Twine and growing twshared has taught us several Many developers wish to run a global service without wor-
lessons. We share some highlights and lowlights below. rying about operational challenges: which regions to deploy
to, how much capacity is needed in each region, and how
7.1 Entitlement Fragmentation to handle regional failures. We currently operate multiple
global Twine deployments that spread a global job’s tasks
We overloaded entitlements with two responsibilities: fleet across regions, similar to how a regional Twine deployment
partitioning and quota management. Entitlements partition spreads a regional job’s tasks across data centers in a region.
millions of machines into smaller units that can be effectively Currently, global jobs account for 8% of all our jobs.
managed by scheduler shards. Twine jobs can only stack We have learned over time that global Twine deployments
within the same entitlement, implying that an entitlement be did not provide the right abstraction for managing global ser-
sized at a few thousand machines, similar to a Borg [39] cell. vices. Machines in a region are largely fungible due to the
On the other hand, leveraging entitlements for quota man- high network bandwidth and low latency within a region, but
agement results in small entitlements. For example, an impor- this is not true for machines distributed across regions. Hence,
tant service may wish for an entitlement with 10 tasks rather it is better to explicitly decompose a service’s global capacity
than a larger entitlement shared with other services to protect needs into capacity needs for specific regions, as opposed to
against the risk that a rogue service grows unexpectedly and global allocators making ad hoc decisions on which regions
uses up the entitlement quota. to get machines from. We are replacing global Twine de-
We are in the process of splitting an entitlement’s respon- ployments with a new Federation system built atop regional
sibility into two new abstractions: a materialization for fleet Twine deployments to provide stronger capacity guarantees
partitioning and a stackable reservation for quota manage- and more holistic support for a global-service abstraction.
ment. A materialization functions as a pseudo cluster, has a
host profile associated with it, and is always large enough to
enable job stacking across thousands of machines. 7.4 Challenges with Small Machines
Our decision to leverage small machines brings with it nu-
7.2 Controlled Customization merous trade-offs. The effort to rearchitect and reimplement
Our goal is ubiquitous shared infrastructure. A difficult les- memory-capacity-bound services was higher than we antic-
son we learned from the first six years of operating twshared ipated. On the other hand, we leveraged this opportunity to
was that customization is key to migrating services over. For holistically modernize our legacy services, e.g., moving from
instance, without host profiles, our web tier and memcache ser- static sharding to dynamic sharding for better load balancing.
vices would not run in twshared as their performance would As small machines run contrary to the industry practice of fa-
regress by 11% and 10.2% respectively. TaskControl has pro- voring big machines; we need to work closely with hardware
vided a path for stateful services such as TAO [7] and MySQL vendors to optimize machines for our internal workloads, e.g.,
to deprecate their custom cluster management tooling and removing unneeded NUMA components.
adopt Twine and shared infrastructure. That said, the 18% power efficiency win (§5.4) from small
We prioritize maintainability when deciding what cus- machines has been worth the above trade-offs. We intend
tomization to permit. Currently, we offer 17 host profiles to continue using small machines in the coming years, but
and 16 TaskControllers to support thousands of services. Our are also prepared to evolve our hardware strategy as needed.
recent migration of ≈70% of a large product group’s ser- Two factors lead to our decision of adopting small machines:
vices into twshared (§6.3) leveraged existing host profiles 1) our legacy large services were optimized for utilizing en-
and TaskControllers. tire machines running in private pools, and 2) our stacking
In hindsight, we permitted some customizations that ap- technology needed to mature and improve support for perfor-
peared useful initially, but later became barriers for fleet-wide mance isolation [42]. As our services undergo architectural
optimizations. For example, a job’s tasks are identical by de- changes to run effectively in twshared, and we improve our
fault, but we provided the ability to customize individual tasks, stacking technology, we may revisit our hardware strategy.
including the executables to run, command line options, en-
vironment variables, and restart policies. Developers abused 8 Related Work
this customization to implement simple sharding so that each
task does different work. Autoscaling changes the number Scalability and scheduling performance. Kubernetes [25]
of tasks in a job and breaks the job’s task customization. As and Hydra [9] scale out through federation, whereas Twine
we enable autoscaling for all ALM-tracking services, we are scales out through sharding. Figures 13 compares the two
removing task customization and migrating these services to approaches. A large body of work [6, 15, 20, 31] focuses on
use Shard Manager [16] instead. improving batch scheduling throughput and latency. Twine
13
delegates the handling of short-lived batch jobs to application- Some systems statically partition machines in a cluster and
level batch schedulers. This separation of concerns helps preconfigure their hardware and OS settings to suit different
Twine scale, as discussed in §3.2. workloads. Others dynamically adjust predetermined settings
(e.g., Turbo [40]) based on runtime profiling, while disallow-
Entitlements. Twine has some similarity to the two-level
ing other customizations (e.g., btrfs vs. ext4). We believe
schedulers (Mesos [17], YARN [37], Apollo [6], and
that Twine is the first system that 1) allows workloads to pro-
Fuxi [44]), with Twine entitlements as resource offers and
vide customized hardware and OS settings to run in a shared
Twine scheduler shards as Application Masters (or frame-
machine pool and 2) dynamically reconfigures a machine just-
works in Mesos). However, the bottom-level Resource Man-
in-time as the workload is scheduled onto the machine. On
ager (or Master in Mesos) is designed for the scale of a single
average, Twine reconfigures a machine once every two days,
cluster. In contrast to the single-master two-level architec-
primarily due to Autoscaling (see Figure 18).
ture, we propose a three-level architecture with sharding so
our design scales out: Resource Broker manages machines, Power-efficient hardware. A large body of work studies
Twine scheduler manages containers, and Application-level power-efficient computing [1, 10, 27]. Our infrastructure is
schedulers manage workloads such as batch and ML. unique in 1) using power-efficient small machines as a uni-
Kubernetes’ cluster autoscaler [24] can respond to work- versal computing platform, and 2) consolidating towards a
load growth by provisioning VMs in a public cloud and adding single compute machine type (one CPU and 64GB RAM), as
them to a node pool. Kubernetes’ resizable node pool cor- opposed to offering a variety of high-memory or high-CPU
responds to Twine’s entitlement, and a public cloud’s avail- machine types. Both approaches required our workloads to
able resources correspond to Twine’s shared free machine make software architectural changes that would be challeng-
pool maintained by Resource Broker. Decoupling Kuber- ing in a public cloud with external customer workloads.
netes and cloud makes the setup flexible, but also misses
optimization opportunities compared with Twine’s integrated Overcommitment and autoscaling. Past work overcommits
ecosystem. Multiple Kubernetes clusters run independently CPU and memory by colocating batch jobs and online ser-
without coordination, whereas Twine’s ReBalancer performs vices [11, 22, 39, 43]. Twine does not overcommit CPU or
global optimization across entitlements, and an entitlement memory by default, although a job owner can explicitly con-
can be migrated across scheduler shards. figure their job to do so. On the other hand, we overcommit
power by default [41], as power is our most constrained re-
TaskControl. The two-level schedulers (Mesos [17], source. Twine helps mitigate power hotspots by relocating
YARN [37], Apollo [6], and Fuxi [44]) allow their appli- tasks across data centers. Twine’s SRM uses historical data
cations to provide custom Application Masters. The interface to predictably adjust the number of tasks in a job. Borg’s
with Application Masters is for negotiating resource alloca- Autopilot [34] adjusts the CPU and memory allocated to each
tion, e.g., “requesting N containers with X CPU and Y mem- task–this is an area of future work for Twine.
ory,” whereas the TaskControl API is for negotiating lifecycle
management, e.g., “delaying restarting task T .”
Kubernetes [23]’s custom controllers provide a universal 9 Conclusion
extension framework that can be used to implement various
custom functions like autoscaling and injecting sidecars for We identify existing cluster management systems’ limitations
traffic routing. In contrast, TaskControl exclusively focuses in supporting large-scale shared infrastructure. We describe
on allowing or delaying task lifecycle operations. This nar- our novel solution that allowed us to scale Twine to manage
row interface strikes a balance between standardization and one million machines in a region, move jobs across phys-
customization (§7.2) and prevents proliferation of customiz- ical clusters, collaborate with applications to manage their
ing all aspects of the Twine control plane. We are unaware lifecycle, support host customization in a shared pool, use
of any Kubernetes custom controller that specifically offers power-efficient small machines to achieve higher performance
extension points to allow or delay task lifecycle operations. per watt, and employ autoscaling to improve machine utiliza-
Azure supports update domains and fault domains [3] and tion. We share our experience with twshared and our strategy
the example stateful service in Figure 2 can improve availabil- towards ubiquitous shared infrastructure.
ity by spreading its data shards’ replicas across those domains.
However, in the event of a machine failure, Azure may still
proceed with a rolling update that can lead to unavailable Acknowledgments
shards because it does not know precisely how the shard repli-
cas are spread across fault domains and update domains. This paper presents the engineering work of several teams
at Facebook that have built Twine and its ecosystem over
Host profiles. Paragon [12] schedules a job on machines the past decade. We thank Niket Agarwal, Marius Eriksen,
that are beneficial to the job’s performance, but it does not Tianyin Xu, Murray Stokely, Seth Hettich, and the OSDI
reconfigure a machine. reviewers for their insightful feedback.
14
References [9] Carlo Curino, Subru Krishnan, Konstantinos Karana-
sos, Sriram Rao, Giovanni M. Fumarola, Botong
[1] David G Andersen, Jason Franklin, Michael Kaminsky, Huang, Kishore Chaliparambil, Arun Suresh, Young
Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. Chen, Solom Heddaya, Roni Burd, Sarvesh Sakalanaga,
FAWN: A Fast Array of Wimpy Nodes. In Proceedings Chris Douglas, Bill Ramsey, and Raghu Ramakrishnan.
of the 22nd ACM Symposium on Operating Systems Hydra: a federated resource manager for data-center
Principles, 2009. scale analytics. In Proceedings of the 16th USENIX
Symposium on Networked Systems Design and Imple-
[2] Alexey Andreyev. Introducing data center fabric, the mentation, 2019.
next-generation Facebook data center network, 2014.
https://engineering.fb.com/production-eng [10] Adrian M. Caulfield, Laura M. Grupp, and Steven
ineering/introducing-data-center-fabric-t Swanson. Gordon: Using Flash Memory to Build Fast,
he-next-generation-facebook-data-center-n Power-efficient Clusters for Data-Intensive Applications.
etwork/. In Proceedings of the 14th International Conference on
Architectural Support for Programming Languages and
[3] Azure update domain and fault domain, 2019. Operating Systems, 2009.
https://docs.microsoft.com/en-us/azure/vi
rtual-machines/availability. [11] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark
Russinovich, Marcus Fontoura, and Ricardo Bianchini.
[4] Mahesh Balakrishnan, Jason Flinn, Chen Shen, Mihir Resource Central: Understanding and Predicting Work-
Dharamshi, Ahmed Jafri, Xiao Shi, Santosh Ghosh, loads for Improved Resource Management in Large
Hazem Hassan, Aaryaman Sagar, Rhed Shi, Jingming Cloud Platforms. In Proceedings of the 26th ACM
Liu, Filip Gruszczynski, Xianan Zhang, Huy Hoang, Symposium on Operating Systems Principles, 2017.
Ahmed Yossef, Francois Richard, and Yee Jiun Song.
Virtual Consensus in Delos. In Proceedings of the 14th [12] Christina Delimitrou and Christos Kozyrakis. Paragon:
USENIX Symposium on Operating Systems Design and QoS-Aware Scheduling for Heterogeneous Datacenters.
Implementation, 2020. In Proceedings of the 18th International Conference on
Architectural Support for Programming Languages and
[5] Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Operating Systems, 2013.
Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar,
Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and [13] Jeffrey Dunn. Introducing FBLearner Flow: Facebook’s
Gregory R. Ganger. The CacheLib Caching Engine: AI backbone, 2016. https://engineering.fb.com
Design and Experiences at Scale. In Proceedings of the /ml-applications/introducing-fblearner-flo
14th USENIX Symposium on Operating Systems Design w-facebook-s-ai-backbone/.
and Implementation, 2020.
[14] João Ferreira, Naader Hasani, Sreedhevi Sankar, Jimmy
[6] Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jin- Williams, and Nina Schiff. Fabric Aggregator: A
gren Zhou, Zhengping Qian, Ming Wu, and Lidong flexible solution to our traffic demand, 2014. Facebook
Zhou. Apollo: Scalable and Coordinated Scheduling blog post. https://engineering.fb.com/data-c
for Cloud-Scale Computing. In Proceedings of the 11th enter-engineering/fabric-aggregator-a-fle
USENIX Symposium on Operating Systems Design and xible-solution-to-our-traffic-demand/.
Implementation, 2014.
[15] Ionel Gog, Malte Schwarzkopf, Adam Gleave,
[7] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Robert N.M. Watson, and Steven Hand. Firmament:
Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Fast, Centralized Cluster Scheduling at Scale. In Pro-
Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, ceedings of the 12th USENIX Symposium on Operating
Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Systems Design and Implementation, 2016.
Venkataramani. TAO: Facebook’s Distributed Data
Store for the Social Graph. In Proceedings of the 2013 [16] Gerald Guo and Thawan Kooburat. Scaling services
USENIX Annual Technical Conference, 2013. with Shard Manager, 2020. Facebook blog post. http
s://engineering.fb.com/production-enginee
[8] Christopher Bunn. Containerizing ZooKeeper with ring/scaling-services-with-shard-manager/.
Twine: Powering container orchestration from within,
2020. Facebook blog post. https://engineering. [17] Benjamin Hindman, Andy Konwinski, Matei Zaharia,
fb.com/developer-tools/zookeeper-twine/. Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott
15
Shenker, and Ion Stoica. Mesos: A Platform for Fine- [27] Toni Mastelic, Ariel Oleksiak, Holger Claussen, Ivona
Grained Resource Sharing in the Data Center . In Pro- Brandic, Jean-Marc Pierson, and Athanasios V Vasi-
ceedings of the 8th USENIX Symposium on Networked lakos. Cloud Computing: Survey on Energy Efficiency.
Systems Design and Implementation, 2011. Acm computing surveys (csur), 47(2):1–36, 2014.
[18] Qi Huang, Petchean Ang, Peter Knowles, Tomasz [28] Yuan Mei, Luwei Cheng, Vanish Talwar, Michael Y.
Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Levin, Gabriela Jacques-Silva, Nikhil Simha, Anirban
Paul Dapolito IV, Xifan Yan, Maxim Bykov, Chuen Banerjee, Brian Smith, Tim Williamson, Serhat Yilmaz,
Liang, Mohit Talwar, Abhishek Mathur, Sachin Weitao Chen, and Guoqiang Jerry Chen. Turbine: Face-
Kulkarni, Matthew Burke, and Wyatt Lloyd. SVE: book’s Service Management Platformfor Stream Pro-
Distributed Video Processing at Facebook Scale. In cessing. In Proceedings of the 36th IEEE International
Proceedings of the 26th Symposium on Operating Conference on Data Engineering, 2020.
Systems Principles, 2017.
[29] Aravind Narayanan, Elisa Shibley, and Mayank Pundir.
[19] Mike Isaac and Sheera Frenkel. Facebook Fault tolerance through optimal workload placement,
Is ‘Just Trying to Keep the Lights On’ as Traf- 2020. Facebook blog post. https://engineering.
fic Soars in Pandemic. The New York Times, fb.com/data-center-engineering/fault-toler
2020. https://www.nytimes.com/2020/03/24/te ance-through-optimal-workload-placement/.
chnology/virus-facebook-usage-traffic.html.
[30] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc
[20] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Kwiatkowski, Herman Lee, Harry C Li, Ryan McElroy,
Wieder, Kunal Talwar, and Andrew Goldberg. Quincy: Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,
Fair Scheduling for Distributed Computing Clusters. In Tony Tung, and Venkateshwaran Venkataramani. Scal-
Proceedings of the 22nd ACM Symposium on Operating ing Memcache at Facebook. In Proceedings of the 10th
Systems Principles, 2009. USENIX Symposium on Networked Systems Design and
Implementation, 2013.
[21] Rui Jian and Hao Lin. Tangram: Distributed Scheduling
Framework for Apache Spark at Facebook, 2019. [31] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and
https://databricks.com/session/tangram-dis Ion Stoica. Sparrow: Distributed, Low Latency Schedul-
tributed-scheduling-framework-for-apache-s ing. In Proceedings of the 24th ACM Symposium on
park-at-facebook. Operating Systems Principles, 2013.
[22] Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, [32] Vijay Rao and Edwin Smith. Facebook’s new server
Shravan Matthur Narayanamurthy, Alexey Tumanov, design delivers on performance without sucking up
Jonathan Yaniv, Ruslan Mavlyutov, Íñigo Goiri, Subru power, 2016. https://engineering.fb.com/dat
Krishnan, Janardhan Kulkarni, and Sriram Rao. Mor- a-center-engineering/facebook-s-new-front
pheus: Towards Automated SLOs for Enterprise Clus- -end-server-design-delivers-on-performance
ters. In Proceedings of the 12th USENIX Symposium on -without-sucking-up-power/.
Operating Systems Design and Implementation, 2016.
[33] Roman Gushchin. Hugetlb: optionally allocate gigantic
[23] Kubernetes, 2020. https://kubernetes.io/. hugepages using cma, 2020. https://lkml.org/l
kml/2020/3/9/1135.
[24] Kubernetes cluster autoscaler, 2020.
https://github.com/kubernetes/autoscaler [34] Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski,
/tree/master/cluster-autoscaler. Przemyslaw Zych, Przemyslaw Broniek, Jarek Kus-
mierek, Pawel Nowak, Beata Strack, Piotr Witusowski,
[25] Kubernetes Federation, 2020. https: Steven Hand, and John Wilkes. Autopilot: workload
//github.com/kubernetes/community/tree/m autoscaling at Google. In Proceedings of the 15th ACM
aster/sig-multicluster. European Conference on Computer Systems, 2020.
[26] Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kon- [35] Chunqiang Tang, Thawan Kooburat, Pradeep Venkat-
torinis, Sree Kodakara, David Lo, and Partha Ran- achalam, Akshay Chander, Zhe Wen, Aravind
ganathan. Thunderbolt: Throughput-Optimized, Narayanan, Patrick Dowell, and Robert Karl. Holistic
Quality-of-Service-Aware Power Capping at Scale. In Configuration Management at Facebook. In Proceed-
Proceedings of the 14th USENIX Symposium on Oper- ings of the 25nd ACM Symposium on Operating Systems
ating Systems Design and Implementation, 2020. Principles, 2015.
16
[36] Muhammad Tirmazi, Adam Barker, Nan Deng, Md Eht- The TURBO Diaries: Application-controlled Frequency
esam Haque, Zhijing Gene Qin, Steven Hand, Mor Scaling Explained. In Proceedings of the 2014 USENIX
Harchol-Balter, and John Wilkes. Borg: the Next Gen- Annual Technical Conference, 2014.
eration. In Proceedings of the 15th ACM European
Conference on Computer Systems, 2020. [41] Qiang Wu, Qingyuan Deng, Lakshmi Ganesh, Chang-
Hong Hsu, Yun Jin, Sanjeev Kumar, Bin Li, Justin Meza,
[37] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Dou-
and Yee Jiun Song. Dynamo: Facebook’s Data Center-
glas, Sharad Agarwal, Mahadev Konar, Robert Evans,
Wide Power Management System. ACM SIGARCH
Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth
Computer Architecture News, 44(3), 2016.
Seth, et al. Apache Hadoop YARN: Yet Another Re-
source Negotiator. In Proceedings of the 4th annual
Symposium on Cloud Computing, 2013. [42] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jna-
gal, Vrigo Gokhale, and John Wilkes. CPI2: CPU
[38] Kaushik Veeraraghavan, Justin Meza, Scott Michel- performance isolation for shared compute clusters. In
son, Sankaralingam Panneerselvam, Alex Gyori, David Proceedings of the 8th ACM European Conference on
Chou, Sonia Margulis, Daniel Obenshain, Shruti Pad- Computer Systems, 2013.
manabha, Ashish Shah, et al. Maelstrom: Mitigating
Datacenter-level Disasters by Draining Interdependent [43] Yunqi Zhang, George Prekas, Giovanni Matteo Fu-
Traffic Safely and Efficiently. In Proceedings of the marola, Marcus Fontoura, Íñigo Goiri, and Ricardo
13th USENIX Symposium on Operating Systems Design Bianchini. History-Based Harvesting of Spare Cycles
and Implementation, 2018. and Storage in Large-Scale Datacenters. In Proceedings
of the 12th USENIX Symposium on Operating Systems
[39] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu,
Design and Implementation, 2016.
David Oppenheimer, Eric Tune, and John Wilkes. Large-
scale cluster management at Google with Borg. In
Proceedings of the 10th ACM European Conference on [44] Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong
Computer Systems, 2015. Tang, and Jie Xu. Fuxi: a Fault-Tolerant Resource Man-
agement and Job Scheduling System at Internet Scale.
[40] Jons-Tobias Wamhoff, Stephan Diestelhorst, Christof Proceedings of the VLDB Endowment, 7(13):1393–
Fetzer, Patrick Marlier, Pascal Felber, and Dave Dice. 1404, 2014.
17