SqlAzure
SqlAzure
2
SQL Azure Database as a Service
• On-demand provisioning of SQL databases
• Familiar relational programming model
– Leverage existing skills and tools
• SLA for availability and performance
• Pay-as-you-go pricing model
• Full control over logical database administration
– No physical database administration headaches
• Large geo-presence
– 3 regions (US, Europe, Asia), each with 2 sub-regions
3
Challenges And Our Approach
• Challenges
– Scale – storage, processing, and delivery
– Consistency – transactions, replication, failures, HA
– Manageability – deployment and self-management
• Our approach
– SQL Server technology as node storage
– Distributed fabric for self-healing and scale
– Automated deployment and provisioning (low OpEx)
– Commodity hardware for reduced CapEx
– Software to achieve required reliability
4
SQL Azure model
5
ARCHITECTURE
6
Network Topology
Applications use standard SQL
Application client libraries: ODBC, ADO.Net,
PHP, JDBC, …
Internet
Azure Cloud
TDS (tcp)
Security Boundary Load balancer forwards ‘sticky’
LB sessions to TDS protocol tier
TDS (tcp)
Gateway: TDS protocol gateway, enforces AUTHN/AUTHZ policy; proxy to SQL tier
TDS (tcp)
7
Scalability and Availability: Fabric, Failover, Replication, and Load balancing
HIGH AVAILABILITY
8
Concepts
Storage Unit
• Supports CRUD
operations
e.g. DB row
10
Replication
• All reads are
completed
at the primary Ack Read
Value
• Writes replicated to Write
write quorum of
replicas Ack
P Ack
• Commit on Ack Ack
secondaries first
then primary S Write Write S
• Each transaction
has a commit S S
Write Write
sequence number
(epoch, num)
11
Reconfiguration
• Types of reconfiguration
– Primary failover
– Removing a failed secondary
– Adding recovered replica
– Building a new secondary Failed
• Assumes B
S
X P
S
P
X
– Failure detector
– Leader election S S
– Both services provided Failed
by Fabric layer
Safe in the presence of
cascading failures
12
Partition Management
• Partition Manager (PM) is a highly available
service running in the Master cluster
– Ensures all partitions are operational
– Places replicas across failure domains
(rack/switch/server)
– Ensures all partitions have target replica count
– Balances the load across all the nodes
• Each node manages multiple partitions
• Global state maintained by the PM can be
recreated from the local node state in event of
disaster (GPM rebuild)
13
System in Operation
Leader
Elector Fabric
Data Node Data Node Data Node Data Node Data Node Data Node
100 101 102 103 104 105
P S P P S P
S S S S PS S
S P S S S S
S S S S
S
15
Recap
• Two kinds of nodes:
– Data nodes store application data
– Master nodes store cluster metadata
• Node failures are reliably detected
– On every node, SQL and Fabric processes monitor
each other
– Fabric processes monitor each other across nodes
• Local failures cause nodes to fail-fast
• Failures cause reconfiguration and placement
changes
16
DEPLOYMENT
17
Hardware Architecture
• Each rack hosts 2 pods of 20
machines each
L2 Switch
• Each pod has a TOR mini-
switch
• 10GB uplink to L2 switch
• Each SQL Azure machine runs
on commodity box
• Example:
• 8 cores
• 32 GB RAM
• 1TB+ SATA drives
• Programmable power
• 1Gb NIC
• Machine spec changes as
hardware (pricing) evolves
18
Hardware Challenges
• SATA drives
– On-disk cache and lack of true "write through" results in
Write Ahead Logging violations
• DB requires in-order writes to be honored
• Can force flush cache, but causes performance degradation
– Disk failures happen daily (at scale), fail-fast on those
• Bit-flips (enabled page checksums)
• Drives just disappear
• IOs are misdirected
• Faulty NIC
– Encountered message corruption
• Enabled message signing and checksums
19
Software Deployment
• OS is automatically imaged via deployment
• All the services are setup using file copy
– Guarantees on which version is running
– Provides fast switch to new version
– Minimal global state allows running side by side
– Yes, that includes the SQL Server DB engine
• Rollout is monitored to ensure high availability
– Knowledge of replica state health ensure SLA is met
– Two phase rollouts for data or protocol changes
• Leverages internal Autopilot technologies with SQL
Azure extensions
20
Software Challenges
• Lack of real-time OS features
– CPU priority
• High priority for Fabric lease traffic
– Page Faults/GC
• Locked pages for SQL and Fabric (in managed code)
• Fail fast or not?
– Yes, for corruption/AV
– No, for other issues unless centrally controlled
• What is really considered failed?
– Some failures are non-deterministic or hangs
– Multiple protocols / channels means partial failures too
21
Monitoring
• Health model w/repair actions
– Reboot Re-deploy Re-image (OS) RMA cycle
• Additional monitoring for SQL tier
– Connect / network probes
– Memory leaks / hung worker processes
– Database corruption detection
– Trace and performance stats capture
• Sourced from regular SQL trace and support mechanisms
• Stored locally and pushed to a global cluster wide store
• Global cluster used for service insight and problem
tracking
22
LESSONS LEARNED
23
How is Cloud Different?
Minor differences:
• Cheap hardware
– No SANs, no SCSI, no Infiniband
– Iffy routers, network cards
– Relatively homogeneous
– Hardware not selected for the purpose
• Lots of it
– Not one machine, not 10 machines – think 1000+
• Public internet
– High latencies, sometimes
– All over the world
– Scary people (untrusted) lurking in the shadows
24
How is Cloud Different?
Real differences:
26
Design for Failure
Observe and
detect
Local
Implement Collect
decisions context Centralized
Commit Send
decisions complaints
Make Aggregate
decisions complaints
27
Design for Mediocre
• Network is not fast or slow, it varies
– Design for huge latency variance
– Machine independence is key
28
Design for (appropriate) Simplicity
• There’s no such thing as a “repro”
– Everything must be debuggable from logs (and dumps)
– This is much harder than it sounds – takes time to log the right stuff
• System state must be externally examinable
– Not locked in internal data structures
• Fail-fast
– Is great! Very hard to reason about partial failures. We kill it fast.
– Is awful! Cascading failures can kill entire system if you are not careful
– Principle: If you are sure it’s local, kill it. If not, not so fast
• ‘No workflows’ is best
– Machine independence is a virtue
– Things that can safely be local, should be
• Single-level workflows is next (reduce number of moving parts)
– Resumable (not tied to a specific machine)
– Design with failure as norm using distributed (persisted) state machines
29
Design for many
• Many machines is great!
– Reduce focus on machine reliability
• By the time RDBMS runs recovery, the world has moved on
– Distribution enables load-balancing
• Focus on elasticity and flexibility
– HA with 100 machines is better than 2
• Load distribution, parallelism of copy
30
Design for multi-tenancy
• Customers like using many machines
– Enables load-balancing and elasticity
– But they don’t like paying for many machines
• Solution: multi-tenancy!
– Everyone gets many slices
• Hard!
– Isolation for security and performance
– Many small databases? Costs….
– Many relationships (replication)
– Tradeoffs: isolation vs. elasticity?
31
Local vs. Global
Balance between local and global is key!
34
Future Work and Challenges
• Performance SLAs
– Delivering on “guaranteed capacity” while consolidating diverse
workloads is hard
• Privacy, Governance and Compliance
– Perceptions and realities
– Private Cloud appliances
• Programming Models
– Support for loosely coupled scaleout patterns such as sharding
– Transparent multi-node scaleout
• Data Redundancy
– Point in time restore (backup knobs)
– Geo-availability for multiple points of presence
• Health Model for Applications
– Data tier is only part of the problem – support for hosting N-tier
apps and providing insight into health and performance
35
QUESTIONS?
36
SQL Azure Links
• SQL Azure
http://www.microsoft.com/windowsazure/sqlazure/
37