0% found this document useful (0 votes)
49 views

Storage Tiering and Erasure Coding in Ceph - 150222

The document discusses erasure coding and cache tiering in Ceph. It provides an overview of the Ceph architecture including RADOS, which is a distributed object store, and higher level services like RBD for block storage and CephFS for a distributed file system. It describes how cache tiering and erasure coding can improve performance and durability in Ceph.

Uploaded by

유중선
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Storage Tiering and Erasure Coding in Ceph - 150222

The document discusses erasure coding and cache tiering in Ceph. It provides an overview of the Ceph architecture including RADOS, which is a distributed object store, and higher level services like RBD for block storage and CephFS for a distributed file system. It describes how cache tiering and erasure coding can improve performance and durability in Ceph.

Uploaded by

유중선
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

ERASURE CODING AND CACHE TIERING

SAGE WEIL – SCALE13X - 2015.02.22


AGENDA
● Ceph architectural overview
● RADOS background
● Cache tiering
● Erasure coding
● Project status, roadmap

2
ARCHITECTURE
CEPH MOTIVATING PRINCIPLES
● All components must scale horizontally
● There can be no single point of failure
● The solution must be hardware agnostic
● Should use commodity hardware
● Self-manage whenever possible
● Open source (LGPL)

● Move beyond legacy approaches


– Client/cluster instead of client/server
– Ad hoc HA

4
CEPH COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS


A web services A reliable, fully- A distributed file
gateway for object distributed block system with POSIX
storage, compatible device with cloud semantics and scale-
with S3 and Swift platform integration out metadata
management

LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors

5
ROBUST SERVICES BUILT ON RADOS
ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS


A web services A reliable, fully- A distributed file
gateway for object distributed block system with POSIX
storage, compatible device with cloud semantics and scale-
with S3 and Swift platform integration out metadata
management

LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors

7
THE RADOS GATEWAY

APPLICATION APPLICATION

REST

RADOSGW RADOSGW
LIBRADOS LIBRADOS

socket

M M

M
RADOS CLUSTER

8
MULTI-SITE OBJECT STORAGE

WEB WEB
APPLICATION APPLICATION
APP APP
SERVER SERVER

CEPH OBJECT CEPH OBJECT


GATEWAY GATEWAY
(RGW) (RGW)
CEPH STORAGE CEPH STORAGE
CLUSTER CLUSTER
(US-EAST) (EU-WEST)

9
RADOSGW MAKES RADOS
WEBBY
RADOSGW:
 REST-based object storage proxy
 Uses RADOS to store objects

Stripes large RESTful objects across
many RADOS objects
 API supports buckets, accounts
 Usage accounting for billing
 Compatible with S3 and Swift applications

11
ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS


A web services A reliable, fully- A distributed file
gateway for object distributed block system with POSIX
storage, compatible device with cloud semantics and scale-
with S3 and Swift platform integration out metadata
management

LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors

12
STORING VIRTUAL DISKS

VM

HYPERVISOR
LIBRBD

M M

RADOS CLUSTER

13
KERNEL MODULE

LINUX HOST
KRBD

M M

RADOS CLUSTER

14
RBD FEATURES
● Stripe images across entire cluster (pool)
● Read-only snapshots
● Copy-on-write clones
● Broad integration
– Qemu
– Linux kernel
– iSCSI (STGT, LIO)
– OpenStack, CloudStack, Nebula, Ganeti, Proxmox
● Incremental backup (relative to snapshots)

15
ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS


A web services A reliable, fully- A distributed file
gateway for object distributed block system with POSIX
storage, compatible device with cloud semantics and scale-
with S3 and Swift platform integration out metadata
management

LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors

16
SEPARATE METADATA SERVER

LINUX HOST
KERNEL MODULE

metadata 01 data
10

M M

M
RADOS CLUSTER

17
SCALABLE METADATA SERVERS

METADATA SERVER
 Manages metadata for a POSIX-compliant
shared filesystem
 Directory hierarchy
 File metadata (owner, timestamps,
mode, etc.)
 Snapshots on any directory
 Clients stripe file data in RADOS
 MDS not in data path
 MDS stores metadata in RADOS
 Dynamic MDS cluster scales to 10s or
100s
 Only required for shared filesystem

18
RADOS
ARCHITECTURAL COMPONENTS

APP HOST/VM CLIENT

RGW RBD CEPHFS


A web services A reliable, fully- A distributed file
gateway for object distributed block system with POSIX
storage, compatible device with cloud semantics and scale-
with S3 and Swift platform integration out metadata
management

LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors

20
RADOS
● Flat object namespace within each pool
● Rich object API (librados)
– Bytes, attributes, key/value data
– Partial overwrite of existing data (mutable objects)
– Single-object compound operations
– RADOS classes (stored procedures)
● Strong consistency (CP system)
● Infrastructure aware, dynamic topology
● Hash-based placement (CRUSH)
● Direct client to server data path

21
RADOS CLUSTER

APPLICATION

M M

M M

M
RADOS CLUSTER

22
RADOS COMPONENTS

OSDs:
 10s to 1000s in a cluster
 One per disk (or one per SSD, RAID
group…)
 Serve stored objects to clients
 Intelligently peer for replication & recovery

Monitors:
 Maintain cluster membership and state

M  Provide consensus for distributed decision-


making
 Small, odd number (e.g., 5)
 Not part of data path

23
OBJECT STORAGE DAEMONS

OSD OSD OSD OSD


M
xfs
btrfs
ext4

FS FS FS FS

DISK DISK DISK DISK M

24
DATA PLACEMENT
WHERE DO OBJECTS LIVE?

?? M
APPLICATION OBJECT

26
A METADATA SERVER?

1
M
APPLICATION
2

27
CALCULATED PLACEMENT

A-G
M

H-N
M
APPLICATION F
O-T

M
U-Z

28
CRUSH

10 10 01 01 11

01

01 01 01 10 01

10
OBJECTS
10
01 11 10 10

01

11

10 10 01 01
01

PLACEMENT GROUPS CLUSTER


(PGs)
29
CRUSH IS A QUICK CALCULATION

10 01 01 11

01 01 10 01

OBJECT

01 11 10 10

10 10 01 01

RADOS CLUSTER

30
CRUSH AVOIDS FAILED DEVICES

10 01 01 11

01 01 10 01

OBJECT

01 11 10 10

10

10 10 01 01

RADOS CLUSTER

31
CRUSH: DECLUSTERED PLACEMENT

32●
Each PG independently maps to a
pseudorandom set of OSDs 10 01 01 11


PGs that map to the same OSD
generally have replicas that do not
0
01 10 01

When an OSD fails, each PG it stored 1

will generally be re-replicated by a


different OSD 01

01 11 10 10
– Highly parallel recovery
– Avoid single-disk recovery bottleneck
10

10 10 01 01

RADOS CLUSTER

32
CRUSH: DYNAMIC DATA PLACEMENT

CRUSH:
 Pseudo-random placement algorithm
 Fast calculation, no lookup
 Repeatable, deterministic
 Statistically uniform distribution
 Stable mapping
 Limited data migration on change
 Rule-based configuration
 Infrastructure topology aware
 Adjustable replication
 Weighted devices (different sizes)

33
DATA IS ORGANIZED INTO POOLS

10 11 10 01
POOL 10 01 01 11
OBJECTS A
01 01 01 10

01 10 11 10
POOL 01 01 10 01
OBJECTS B
10 01 01 01

POOL
OBJECTS C 10 01 10 11
01 11 10 10

01 10 01 01

POOL
OBJECTS D 11 10 01 10
10 10 01 01

01 01 10 01

CLUSTER
POOLS
(CONTAINING PGs)
34
TIERED STORAGE
TWO WAYS TO CACHE
● Within each OSD
– Combine SSD and HDD under each OSD OSD
– Make localized promote/demote decisions
– Leverage existing tools FS
● dm-cache, bcache, FlashCache BLOCKDEV
● Variety of caching controllers
HDD SSD
– We can help with hints
● Cache on separate devices/nodes
– Different hardware for different tiers
● Slow nodes for cold data
● High performance nodes for hot data
– Add, remove, scale each tier independently
● Unlikely to choose right ratios at procurement time
36
TIERED STORAGE

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

37
RADOS TIERING PRINCIPLES
● Each tier is a RADOS pool
– May be replicated or erasure coded
● Tiers are durable
– e.g., replicate across SSDs in multiple hosts
● Each tier has its own CRUSH policy
– e.g., map cache pool to SSDs devices/hosts only
● librados adapts to tiering topology
– Transparently direct requests accordingly
● e.g., to cache
– No changes to RBD, RGW, CephFS, etc.

38
READ (CACHE HIT)

CEPH CLIENT

READ READ REPLY

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

39
READ (CACHE MISS)

CEPH CLIENT

READ READ REPLY

CACHE POOL (SSD): WRITEBACK

PROXY READ

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

40
READ (CACHE MISS)

CEPH CLIENT

READ READ REPLY

CACHE POOL (SSD): WRITEBACK

PROMOTE

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

42
WRITE (HIT)

CEPH CLIENT

WRITE ACK

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

43
WRITE (MISS)

CEPH CLIENT

WRITE ACK

CACHE POOL (SSD): WRITEBACK

PROMOTE

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

44
WRITE (MISS) (COMING SOON)

CEPH CLIENT

WRITE ACK

CACHE POOL (SSD): WRITEBACK

PROXY WRITE

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

45
ESTIMATING TEMPERATURE
● Each PG constructs in-memory bloom filters
– Insert records on both read and write
– Each filter covers configurable period (e.g., 1 hour)
– Tunable false positive probability (e.g., 5%)
– Store most recent N periods on disk (e.g., last 24 hours)
● Estimate temperature
– Has object been accessed in any of the last N periods?
– ...in how many of them?
– Informs the flush/evict decision
● Estimate “recency”
– How many periods since the object hasn't been accessed?
– Informs read miss behavior: proxy vs promote
46
FLUSH AND/OR EVICT
COLD DATA

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

FLUSH ACK EVICT

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

47
TIERING AGENT
● Each PG has an internal tiering agent
– Manages PG based on administrator defined policy
● Flush dirty objects
– When pool reaches target dirty ratio
– Tries to select cold objects
– Marks objects clean when they have been written back
to the base pool
● Evict (delete) clean objects
– Greater “effort” as cache pool approaches target size

48
CACHE TIER USAGE
● Cache tier should be faster than the base tier
● Cache tier should be replicated (not erasure coded)
● Promote and flush are expensive
– Best results when object temperature are skewed
● Most I/O goes to small number of hot objects
– Cache should be big enough to capture most of the
acting set
● Challenging to benchmark
– Need a realistic workload (e.g., not 'dd') to determine
how it will perform in practice
– Takes a long time to “warm up” the cache

49
ERASURE CODING
ERASURE CODING

OBJECT OBJECT

COPY
1 2 3 4 X Y
COPY COPY

REPLICATED POOL ERASURE CODED POOL

CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER

Full copies of stored objects One copy plus parity


 Very high durability  Cost-effective durability
 3x (200% overhead)  1.5x (50% overhead)
52  Quicker recovery  Expensive recovery
ERASURE CODING SHARDS

OBJECT

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

53
ERASURE CODING SHARDS

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

0 1 2 3 A A'
4 5 6 7 B B'
8 9 10 9 C C'
12 13 14 15 D D'
16 17 18 19 E E'


Variable stripe size (e.g., 4 KB)
CEPH STORAGE CLUSTER

Zero-fill shards (logically) in partial tail stripe
54
PRIMARY COORDINATES

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

55
EC READ

CEPH CLIENT
READ

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

56
EC READ

CEPH CLIENT
READ

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

READS ERASURE CODED POOL

CEPH STORAGE CLUSTER

57
EC READ

CEPH CLIENT
READ REPLY

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

58
EC WRITE

CEPH CLIENT
WRITE

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

59
EC WRITE

CEPH CLIENT
WRITE

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

WRITES ERASURE CODED POOL

CEPH STORAGE CLUSTER

60
EC WRITE

CEPH CLIENT
WRITE ACK

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

61
EC WRITE: DEGRADED

CEPH CLIENT
WRITE

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

WRITES ERASURE CODED POOL

CEPH STORAGE CLUSTER

62
EC WRITE: PARTIAL FAILURE

CEPH CLIENT
WRITE

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

WRITES ERASURE CODED POOL

CEPH STORAGE CLUSTER

63
EC WRITE: PARTIAL FAILURE

CEPH CLIENT

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

B B A B A A
ERASURE CODED POOL

CEPH STORAGE CLUSTER

64
EC RESTRICTIONS
● Overwrite in place will not work in general
● Log and 2PC would increase complexity, latency
● We chose to restrict allowed operations
– create
– append (on stripe boundary)
– remove (keep previous generation of object for some time)
● These operations can all easily be rolled back locally
– create → delete
– append → truncate
– remove → roll back to previous generation
● Object attrs preserved in existing PG logs (they are small)
● Key/value data is not allowed on EC pools
65
EC WRITE: PARTIAL FAILURE

CEPH CLIENT

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

B B A B A A
ERASURE CODED POOL

CEPH STORAGE CLUSTER

66
EC WRITE: PARTIAL FAILURE

CEPH CLIENT

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

A A A A A A
ERASURE CODED POOL

CEPH STORAGE CLUSTER

67
EC RESTRICTIONS
● This is a small subset of allowed librados operations
– Notably cannot (over)write any extent
● Coincidentally, unsupported operations are also
inefficient for erasure codes
– Generally require read/modify/write of affected stripe(s)
● Some can consume EC directly
– RGW (no object data update in place)
● Others can combine EC with a cache tier (RBD,
CephFS)
– Replication for warm/hot data
– Erasure coding for cold data
– Tiering agent skips objects with key/value data
68
WHICH ERASURE CODE?
● The EC algorithm and implementation are pluggable
– jerasure/gf-complete (free, open, and very fast)
– ISA-L (Intel library; optimized for modern Intel procs)
– LRC (local recovery code – layers over existing plugins)
– SHEC (trades extra storage for recovery efficiency – new from Fujitsu)
● Parameterized
– Pick “k” and “m”, stripe size
● OSD handles data path, placement, rollback, etc.
● Erasure plugin handles
– Encode and decode math
– Given these available shards, which ones should I fetch to satisfy a
read?
– Given these available shards and these missing shards, which ones
should I fetch to recover?
69
COST OF RECOVERY

1 TB OSD

70
COST OF RECOVERY

1 TB OSD

71
COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

72
COST OF RECOVERY (REPLICATION)

1 TB OSD

.01 TB .01 TB

.01 TB .01 TB

.01 TB .01 TB

...
...

73
COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

74
COST OF RECOVERY (EC)

1 TB OSD

1 TB

1 TB

1 TB

1 TB
75
LOCAL RECOVERY CODE (LRC)

OBJECT

1 2 3 4 X Y
OSD OSD OSD OSD OSD OSD

A B C
OSD OSD OSD

ERASURE CODED POOL

CEPH STORAGE CLUSTER

76
BIG THANKS TO
● Ceph
– Loic Dachary (CloudWatt, FSF France, Red Hat)
– Andreas Peters (CERN)
– Sam Just (Inktank / Red Hat)
– David Zafman (Inktank / Red Hat)
● jerasure / gf-complete
– Jim Plank (University of Tennessee)
– Kevin Greenan (Box.com)
● Intel (ISL plugin)
● Fujitsu (SHEC plugin)

77
ROADMAP
WHAT'S NEXT
● Erasure coding
– Allow (optimistic) client reads directly from shards
– ARM optimizations for jerasure
● Cache pools
– Better agent decisions (when to flush or evict)
– Supporting different performance profiles
● e.g., slow / “cheap” flash can read just as fast
– Complex topologies
● Multiple readonly cache tiers in multiple sites
● Tiering
– Support “redirects” to (very) cold tier below base pool
– Enable dynamic spin-down, dedup, and other features
79
OTHER ONGOING WORK
● Performance optimization (SanDisk, Intel, Mellanox)
● Alternative OSD backends
– New backend: hybrid key/value and file system
– leveldb, rocksdb, LMDB
● Messenger (network layer) improvements
– RDMA support (libxio – Mellanox)
– Event-driven TCP implementation (UnitedStack)
● CephFS
– Online consistency checking and repair tools
– Performance, robustness
● Multi-datacenter RBD, RADOS replication
80
FOR MORE INFORMATION
● http://ceph.com
● http://github.com/ceph
● http://tracker.ceph.com
● Mailing lists
– ceph-users@ceph.com
– ceph-devel@vger.kernel.org
● irc.oftc.net
– #ceph
– #ceph-devel
● Twitter
– @ceph
81
THANK YOU!

Sage Weil
CEPH PRINCIPAL
ARCHITECT

sage@redhat.com

@liewegas

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy