Getting Started With Distributed SQ
Getting Started With Distributed SQ
Getting Started With Distributed SQ
Getting
CONTENTS
Started With
• Shared Characteristics of
Distributed SQL Databases
Distributed SQL
SQL Databases
• Cost Considerations
• Learn More
ANDREW OLIVER
SR. DIRECTOR OF PRODUCT MARKETING, MARIADB
Mission-critical applications have evolved from on-premises Distributed SQL databases are designed to be general-purpose
deployments with a few megabytes of data and users to a cloud- operational databases. Distributed SQL databases are most useful
native infrastructure with far more data and users. Business as operational stores where scale, availability, and disaster recovery
expectations have changed from eight-hour-per-day uptime to requirements exceed the capabilities of a traditional relational
24/7 global availability with virtually no downtime/maintenance database. For example, Samsung uses a distributed SQL database
windows. The relational systems of yesteryear are not up to the to store its customer information for their Samsung cloud service, a
task in terms of scalability, availability, resilience, or performance photo and information service similar to Apple's iCloud. ShortStack
under load. NoSQL databases do not offer the robust functionality uses a distributed SQL database to handle their user data for running
or transactional integrity required for systems of record. online contests. Example use cases include:
Distributed SQL databases are distinct from some other types of non-
traditional relational databases. For instance, Amazon Aurora allows
only a single writer with many replicas or two writers (multi-master)
with no additional replicas. Aurora relies on shared storage for
reliability and scalability. The term "NewSQL" was previously used to
be more inclusive of other types of databases, including in-memory
databases like VoltDB. While keeping all or most data in memory
can lead to lower latency and is good for specialized use cases, it is
not cost-effective for applications at a greater level of scale. Some
NewSQL databases are actually analytical stores.
1
XPAND YOUR
EXPECTATIONS
Distributed SQL now available in SkySQL
SkySQL is the only DBaaS capable of deploying MariaDB as a distributed SQL database for
scalable, high-performance transaction processing or as a multi-node columnar database for
data warehousing and ad hoc analytics. SkySQL makes it easy to start small and scale when
needed, as much as needed – whether it’s the result of continued business growth or an
exponential surge (e.g., successful Black Friday/Cyber Monday promotions).
REFCARD | GETTING STARTED WITH DISTRIBUTED SQL
Figure 1
When a client reads from a distributed SQL database, the database
computes the hash and selects one or more nodes to surface the
requested data. Likewise, queries may also be similarly distributed
among multiple nodes in the database. Because data is distributed,
reads can pull from multiple storage devices at the same time. An
example is shown in Figure 3.
Figure 3
In order to ensure data is consistent when written or updated, the replicas are distributed among cloud availability zones (or different
database uses a type of distributed transaction protocol similar to racks in private data centers). However, while some databases
two-phase commit. Modern distributed SQL databases primarily support distributing replicas among geographic regions, replication
use a consensus algorithm such as Paxos or Raft. These protocols over large distances results in significant latency.
coordinate membership in the cluster along with ensuring that data
To address this, an eventually consistent latency-tolerant replication
is written to the correct nodes in order to guarantee data consistency
protocol is used across data centers (see Figure 4 on the next page).
and reliability. Distributed SQL databases work best in the cloud if
Figure 4
Figure 5
Despite the similarity and intentional compatibility, there are DBaaS formation, as a customer install, and even hybrid installations
often differences in how data is modeled compared to traditional where the DBaaS can manage local instances and replicate between
relational databases. The most obvious is that sequences are a private data center and a cloud installation, and vice versa.
highly discouraged because generating a sequence across a
COMPATIBILITY
distributed cluster creates a bottleneck that hampers scalability
Distributed SQL databases strive to be compatible with existing
and performance. Instead, natural keys or randomly generated
traditional RDBMSs. However, similar to the previous generation of
unique keys are preferred.
relational databases, there are differences in dialects, data types,
GENERAL ARCHITECTURE and extended functionality like procedural languages. Leading
Distributed SQL databases are based on the same general distributed SQL databases have varied approaches to address
architecture. Data is stored on multiple nodes. Writes are balanced compatibility.
between those nodes and assigned via a hashing algorithm, while
MariaDB Xpand, for example, offers two topologies: one that serves
reads are likewise balanced. Data is replicated to more than one node,
as a compatible storage engine for the existing MariaDB Enterprise
so a distributed SQL database can survive the loss of one or more
Server, and the other as a "performance topology" that circumvents
nodes. Writes and updates are handled via a distributed transaction
the front end. The "compatibility mode" offers the strongest
that is coordinated among nodes. Some combination of client-side
compatibility with MySQL and MariaDB (along with extensions
proxies or a load balancer directs traffic between database nodes.
for Oracle's PL/SQL). The "performance topology" offers higher
ACID TRANSACTIONS performance and scale and lower latency.
Unlike other distributed database technologies (i.e., NoSQL),
CockroachDB attempts to be wire compatible with PostgreSQL but
distributed SQL databases are designed for systems of record.
reimplements the query engine to distribute processing, which is
They supply transactional integrity and strong consistency from
similar to Xpand. Yugabyte preserves the PostgreSQL front end and
the ground up with coordinated writes, locked records, and other
uses it for query processing in a way similar to MariaDB Xpand in
methods such as multi-version concurrency control.
compatibility mode.
SYNCHRONOUS REPLICATION
Distributed SQL databases use synchronous replication between For complex applications migrating to distributed SQL, an existing
nodes to ensure transactional integrity with continuous availability. traditional RDBMS front end in compatibility mode may make
When a write takes place, each node acknowledges the write. Other the most sense, particularly if you're using extended features of a
similar types of databases, like Amazon Aurora, use asynchronous traditional database. However, if you're running in production over
replication, which could cause inconsistent writes between nodes. the long term, migrating to a performance topology is likely a better
option than using an existing front end.
QUERY DISTRIBUTION
Compared to client-server database technologies, distributed CONSENSUS ALGORITHM
SQL database queries are replicated to any number of database In the early 2010s, NoSQL databases were widely popular for their
nodes. Additionally, data can be pulled from multiple nodes and scalability features. However, they relaxed transactional consistency
aggregated into a single result set. Some distributed SQL databases and removed key database features, including joins. While adoption
even distribute processing parts of complex queries (i.e., joins, of NoSQL was swift for applications where scale and concurrency
subqueries) to different nodes. were the most important factors, most mission-critical applications
that required transactional integrity remained in client-server
DIFFERENCES BETWEEN DISTRIBUTED databases like Oracle, MySQL, PostgreSQL, and SQL Server.
SQL DATABASES
While the basic architectural approach of distributed SQL databases Meanwhile, ongoing research into the Paxos consensus algorithm
is easily recognized and distinct from both NoSQL and traditional and database design made higher-scale, transactionally correct
relational databases, there are some key differences between them. relational databases possible. Unfortunately, Paxos is considered
hard to implement. Other algorithms, including Calvin and Raft, were
DELIVERY (CLOUD/DBAAS, ON-PREMISES,
also developed. Calvin is not ideal for dynamic queries, which are
OR HYBRID)
common in SQL databases. Raft proved to be easier to implement and
At this time, every distributed SQL database can be installed in the
is used by most distributed SQL databases, except MariaDB Xpand
cloud; however, not all of them offer a fully managed database-as-
and Google Spanner.
a-service (DBaaS). Some distributed SQL databases are available in
There is continued discussion and academic research into which COLUMNAR INDEXES/MIXED
algorithm is "better," but for the most part, the difference lies in WORKLOAD SUPPORT
the implementation details, which is not of great interest to most Distributed SQL databases are operational or transactional
database developers and administrators. It should be noted that databases by nature. However, by adding columnar indexes,
the application of this technology is what made distributed SQL distributed SQL databases can handle real-time analytical queries.
databases possible. Consider the case of e-commerce: The majority of queries will be
light reads and writes, but eventually, someone will want to report
Early distributed SQL implementations include the Clustrix database, on the sales or types of customer engagements — or even offload
originally available as an appliance, MySQL Cluster, and Google's summaries into a data warehouse. These are long-running analytical
Spanner. Spanner requires hardware atomic clocks in order to work. queries that may benefit from a columnar index. Most distributed
Most distributed SQL databases evolved clock synchronization and SQL databases do not yet have this capability, but it can be expected
drift detection algorithms and no longer require hardware-based to become more commonplace as developers look to consolidate
atomic clocks, which allows them to be deployed on general use and simplify their data architecture.
hardware and cloud computing services.
Figure 6
SCALABILITY
The distributed SQL architecture enables horizontal scalability;
however, implementation details have a large impact on production
reality. The key to scalability is how data is assigned to nodes and
how data is rebalanced over time. Additionally, load balancing plays
a central role in both scalability and performance.
load generation infrastructure is maxed rather than the system under Andrew C. Oliver is the Senior Director of
test. It is equally essential to ensure the client network and other Product Marketing for MariaDB. He is a prolific
writer about technology — particularly open-
infrastructure between the load generator and system under test
source and distributed database technologies.
have sufficient capacity.
In the past, he served on the board of the Open Source Initiative,
founded Apache POI, and was an early part of JBoss, Inc. before
COST CONSIDERATIONS
its acquisition by Red Hat.
Evaluating cost is more complex than simply reviewing licensing, cost
Find him over on Twitter @acoliver.
per hour, or any other vendor-advertised measure. It is important to
consider the entire cost of the system, including factors such as:
• Staff training
• Ongoing maintenance
• Risk of loss of service during a failure
• Downtime during upgrades
• Support and support quality DZone, a Devada Media Property, is the resource software developers,
engineers, and architects turn to time and again to learn new skills, solve
• IOPS for cloud services software development problems, and share their expertise. Every day,
hundreds of thousands of developers come to DZone to read about the latest
technologies, methodologies, and best practices. That makes DZone the ideal
LEARN MORE place for developer marketers to build product and brand awareness and
drive sales. DZone clients include some of the most innovative technology
Distributed SQL databases are one of the hottest new technologies and tech-enabled companies in the world including Red Hat, Cloud Elements,
Sensu, and Sauce Labs.
in cloud computing. They offer transactional integrity without
sacrificing scalability and are built for reliability in the cloud. This
new technology makes it possible to bring applications that require Devada, Inc.
600 Park Offices Drive
a system of record to the cloud. The following resources provide Suite 300
Research Triangle Park, NC 27709
additional information on distributed SQL databases:
888.678.0399 | 919.678.0300
• "Distributed SQL"
Copyright © 2021 Devada, Inc. All rights reserved. No part of this publication
https://en.wikipedia.org/wiki/Distributed_SQL may be reproduced, stored in a retrieval system, or transmitted, in any form
or by means of electronic, mechanical, photocopying, or otherwise, without
• "What You Need to Know About Distributed SQL" prior written permission of the publisher.
https://dzone.com/articles/what-you-need-to-know-about-
distributed-sql