Storage Tiering and Erasure Coding in Ceph - 150222
Storage Tiering and Erasure Coding in Ceph - 150222
2
ARCHITECTURE
CEPH MOTIVATING PRINCIPLES
● All components must scale horizontally
● There can be no single point of failure
● The solution must be hardware agnostic
● Should use commodity hardware
● Self-manage whenever possible
● Open source (LGPL)
4
CEPH COMPONENTS
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
5
ROBUST SERVICES BUILT ON RADOS
ARCHITECTURAL COMPONENTS
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
7
THE RADOS GATEWAY
APPLICATION APPLICATION
REST
RADOSGW RADOSGW
LIBRADOS LIBRADOS
socket
M M
M
RADOS CLUSTER
8
MULTI-SITE OBJECT STORAGE
WEB WEB
APPLICATION APPLICATION
APP APP
SERVER SERVER
9
RADOSGW MAKES RADOS
WEBBY
RADOSGW:
REST-based object storage proxy
Uses RADOS to store objects
●
Stripes large RESTful objects across
many RADOS objects
API supports buckets, accounts
Usage accounting for billing
Compatible with S3 and Swift applications
11
ARCHITECTURAL COMPONENTS
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
12
STORING VIRTUAL DISKS
VM
HYPERVISOR
LIBRBD
M M
RADOS CLUSTER
13
KERNEL MODULE
LINUX HOST
KRBD
M M
RADOS CLUSTER
14
RBD FEATURES
● Stripe images across entire cluster (pool)
● Read-only snapshots
● Copy-on-write clones
● Broad integration
– Qemu
– Linux kernel
– iSCSI (STGT, LIO)
– OpenStack, CloudStack, Nebula, Ganeti, Proxmox
● Incremental backup (relative to snapshots)
15
ARCHITECTURAL COMPONENTS
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
16
SEPARATE METADATA SERVER
LINUX HOST
KERNEL MODULE
metadata 01 data
10
M M
M
RADOS CLUSTER
17
SCALABLE METADATA SERVERS
METADATA SERVER
Manages metadata for a POSIX-compliant
shared filesystem
Directory hierarchy
File metadata (owner, timestamps,
mode, etc.)
Snapshots on any directory
Clients stripe file data in RADOS
MDS not in data path
MDS stores metadata in RADOS
Dynamic MDS cluster scales to 10s or
100s
Only required for shared filesystem
18
RADOS
ARCHITECTURAL COMPONENTS
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
20
RADOS
● Flat object namespace within each pool
● Rich object API (librados)
– Bytes, attributes, key/value data
– Partial overwrite of existing data (mutable objects)
– Single-object compound operations
– RADOS classes (stored procedures)
● Strong consistency (CP system)
● Infrastructure aware, dynamic topology
● Hash-based placement (CRUSH)
● Direct client to server data path
21
RADOS CLUSTER
APPLICATION
M M
M M
M
RADOS CLUSTER
22
RADOS COMPONENTS
OSDs:
10s to 1000s in a cluster
One per disk (or one per SSD, RAID
group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:
Maintain cluster membership and state
23
OBJECT STORAGE DAEMONS
FS FS FS FS
24
DATA PLACEMENT
WHERE DO OBJECTS LIVE?
?? M
APPLICATION OBJECT
26
A METADATA SERVER?
1
M
APPLICATION
2
27
CALCULATED PLACEMENT
A-G
M
H-N
M
APPLICATION F
O-T
M
U-Z
28
CRUSH
10 10 01 01 11
01
01 01 01 10 01
10
OBJECTS
10
01 11 10 10
01
11
10 10 01 01
01
10 01 01 11
01 01 10 01
OBJECT
01 11 10 10
10 10 01 01
RADOS CLUSTER
30
CRUSH AVOIDS FAILED DEVICES
10 01 01 11
01 01 10 01
OBJECT
01 11 10 10
10
10 10 01 01
RADOS CLUSTER
31
CRUSH: DECLUSTERED PLACEMENT
32●
Each PG independently maps to a
pseudorandom set of OSDs 10 01 01 11
●
PGs that map to the same OSD
generally have replicas that do not
0
01 10 01
●
When an OSD fails, each PG it stored 1
01 11 10 10
– Highly parallel recovery
– Avoid single-disk recovery bottleneck
10
10 10 01 01
RADOS CLUSTER
32
CRUSH: DYNAMIC DATA PLACEMENT
CRUSH:
Pseudo-random placement algorithm
Fast calculation, no lookup
Repeatable, deterministic
Statistically uniform distribution
Stable mapping
Limited data migration on change
Rule-based configuration
Infrastructure topology aware
Adjustable replication
Weighted devices (different sizes)
33
DATA IS ORGANIZED INTO POOLS
10 11 10 01
POOL 10 01 01 11
OBJECTS A
01 01 01 10
01 10 11 10
POOL 01 01 10 01
OBJECTS B
10 01 01 01
POOL
OBJECTS C 10 01 10 11
01 11 10 10
01 10 01 01
POOL
OBJECTS D 11 10 01 10
10 10 01 01
01 01 10 01
CLUSTER
POOLS
(CONTAINING PGs)
34
TIERED STORAGE
TWO WAYS TO CACHE
● Within each OSD
– Combine SSD and HDD under each OSD OSD
– Make localized promote/demote decisions
– Leverage existing tools FS
● dm-cache, bcache, FlashCache BLOCKDEV
● Variety of caching controllers
HDD SSD
– We can help with hints
● Cache on separate devices/nodes
– Different hardware for different tiers
● Slow nodes for cold data
● High performance nodes for hot data
– Add, remove, scale each tier independently
● Unlikely to choose right ratios at procurement time
36
TIERED STORAGE
APPLICATION
37
RADOS TIERING PRINCIPLES
● Each tier is a RADOS pool
– May be replicated or erasure coded
● Tiers are durable
– e.g., replicate across SSDs in multiple hosts
● Each tier has its own CRUSH policy
– e.g., map cache pool to SSDs devices/hosts only
● librados adapts to tiering topology
– Transparently direct requests accordingly
● e.g., to cache
– No changes to RBD, RGW, CephFS, etc.
38
READ (CACHE HIT)
CEPH CLIENT
39
READ (CACHE MISS)
CEPH CLIENT
PROXY READ
40
READ (CACHE MISS)
CEPH CLIENT
PROMOTE
42
WRITE (HIT)
CEPH CLIENT
WRITE ACK
43
WRITE (MISS)
CEPH CLIENT
WRITE ACK
PROMOTE
44
WRITE (MISS) (COMING SOON)
CEPH CLIENT
WRITE ACK
PROXY WRITE
45
ESTIMATING TEMPERATURE
● Each PG constructs in-memory bloom filters
– Insert records on both read and write
– Each filter covers configurable period (e.g., 1 hour)
– Tunable false positive probability (e.g., 5%)
– Store most recent N periods on disk (e.g., last 24 hours)
● Estimate temperature
– Has object been accessed in any of the last N periods?
– ...in how many of them?
– Informs the flush/evict decision
● Estimate “recency”
– How many periods since the object hasn't been accessed?
– Informs read miss behavior: proxy vs promote
46
FLUSH AND/OR EVICT
COLD DATA
CEPH CLIENT
47
TIERING AGENT
● Each PG has an internal tiering agent
– Manages PG based on administrator defined policy
● Flush dirty objects
– When pool reaches target dirty ratio
– Tries to select cold objects
– Marks objects clean when they have been written back
to the base pool
● Evict (delete) clean objects
– Greater “effort” as cache pool approaches target size
48
CACHE TIER USAGE
● Cache tier should be faster than the base tier
● Cache tier should be replicated (not erasure coded)
● Promote and flush are expensive
– Best results when object temperature are skewed
● Most I/O goes to small number of hot objects
– Cache should be big enough to capture most of the
acting set
● Challenging to benchmark
– Need a realistic workload (e.g., not 'dd') to determine
how it will perform in practice
– Takes a long time to “warm up” the cache
49
ERASURE CODING
ERASURE CODING
OBJECT OBJECT
COPY
1 2 3 4 X Y
COPY COPY
OBJECT
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
53
ERASURE CODING SHARDS
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
0 1 2 3 A A'
4 5 6 7 B B'
8 9 10 9 C C'
12 13 14 15 D D'
16 17 18 19 E E'
●
Variable stripe size (e.g., 4 KB)
CEPH STORAGE CLUSTER
●
Zero-fill shards (logically) in partial tail stripe
54
PRIMARY COORDINATES
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
55
EC READ
CEPH CLIENT
READ
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
56
EC READ
CEPH CLIENT
READ
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
57
EC READ
CEPH CLIENT
READ REPLY
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
58
EC WRITE
CEPH CLIENT
WRITE
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
59
EC WRITE
CEPH CLIENT
WRITE
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
60
EC WRITE
CEPH CLIENT
WRITE ACK
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
61
EC WRITE: DEGRADED
CEPH CLIENT
WRITE
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
62
EC WRITE: PARTIAL FAILURE
CEPH CLIENT
WRITE
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
63
EC WRITE: PARTIAL FAILURE
CEPH CLIENT
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
B B A B A A
ERASURE CODED POOL
64
EC RESTRICTIONS
● Overwrite in place will not work in general
● Log and 2PC would increase complexity, latency
● We chose to restrict allowed operations
– create
– append (on stripe boundary)
– remove (keep previous generation of object for some time)
● These operations can all easily be rolled back locally
– create → delete
– append → truncate
– remove → roll back to previous generation
● Object attrs preserved in existing PG logs (they are small)
● Key/value data is not allowed on EC pools
65
EC WRITE: PARTIAL FAILURE
CEPH CLIENT
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
B B A B A A
ERASURE CODED POOL
66
EC WRITE: PARTIAL FAILURE
CEPH CLIENT
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
A A A A A A
ERASURE CODED POOL
67
EC RESTRICTIONS
● This is a small subset of allowed librados operations
– Notably cannot (over)write any extent
● Coincidentally, unsupported operations are also
inefficient for erasure codes
– Generally require read/modify/write of affected stripe(s)
● Some can consume EC directly
– RGW (no object data update in place)
● Others can combine EC with a cache tier (RBD,
CephFS)
– Replication for warm/hot data
– Erasure coding for cold data
– Tiering agent skips objects with key/value data
68
WHICH ERASURE CODE?
● The EC algorithm and implementation are pluggable
– jerasure/gf-complete (free, open, and very fast)
– ISA-L (Intel library; optimized for modern Intel procs)
– LRC (local recovery code – layers over existing plugins)
– SHEC (trades extra storage for recovery efficiency – new from Fujitsu)
● Parameterized
– Pick “k” and “m”, stripe size
● OSD handles data path, placement, rollback, etc.
● Erasure plugin handles
– Encode and decode math
– Given these available shards, which ones should I fetch to satisfy a
read?
– Given these available shards and these missing shards, which ones
should I fetch to recover?
69
COST OF RECOVERY
1 TB OSD
70
COST OF RECOVERY
1 TB OSD
71
COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
72
COST OF RECOVERY (REPLICATION)
1 TB OSD
.01 TB .01 TB
.01 TB .01 TB
.01 TB .01 TB
...
...
73
COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
74
COST OF RECOVERY (EC)
1 TB OSD
1 TB
1 TB
1 TB
1 TB
75
LOCAL RECOVERY CODE (LRC)
OBJECT
1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD
A B C
OSD OSD OSD
76
BIG THANKS TO
● Ceph
– Loic Dachary (CloudWatt, FSF France, Red Hat)
– Andreas Peters (CERN)
– Sam Just (Inktank / Red Hat)
– David Zafman (Inktank / Red Hat)
● jerasure / gf-complete
– Jim Plank (University of Tennessee)
– Kevin Greenan (Box.com)
● Intel (ISL plugin)
● Fujitsu (SHEC plugin)
77
ROADMAP
WHAT'S NEXT
● Erasure coding
– Allow (optimistic) client reads directly from shards
– ARM optimizations for jerasure
● Cache pools
– Better agent decisions (when to flush or evict)
– Supporting different performance profiles
● e.g., slow / “cheap” flash can read just as fast
– Complex topologies
● Multiple readonly cache tiers in multiple sites
● Tiering
– Support “redirects” to (very) cold tier below base pool
– Enable dynamic spin-down, dedup, and other features
79
OTHER ONGOING WORK
● Performance optimization (SanDisk, Intel, Mellanox)
● Alternative OSD backends
– New backend: hybrid key/value and file system
– leveldb, rocksdb, LMDB
● Messenger (network layer) improvements
– RDMA support (libxio – Mellanox)
– Event-driven TCP implementation (UnitedStack)
● CephFS
– Online consistency checking and repair tools
– Performance, robustness
● Multi-datacenter RBD, RADOS replication
80
FOR MORE INFORMATION
● http://ceph.com
● http://github.com/ceph
● http://tracker.ceph.com
● Mailing lists
– ceph-users@ceph.com
– ceph-devel@vger.kernel.org
● irc.oftc.net
– #ceph
– #ceph-devel
● Twitter
– @ceph
81
THANK YOU!
Sage Weil
CEPH PRINCIPAL
ARCHITECT
sage@redhat.com
@liewegas