Getting Started With Distributed SQ

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

BROUGHT TO YOU IN PARTNERSHIP WITH

Getting
CONTENTS

•  About Distributed SQL

•  How Distributed SQL Works

Started With
•  Shared Characteristics of
Distributed SQL Databases

•  Differences Between Distributed

Distributed SQL
SQL Databases

•  Evaluating Distributed SQL


Databases

•  Cost Considerations

•  Learn More
ANDREW OLIVER
SR. DIRECTOR OF PRODUCT MARKETING, MARIADB

Mission-critical applications have evolved from on-premises Distributed SQL databases are designed to be general-purpose
deployments with a few megabytes of data and users to a cloud- operational databases. Distributed SQL databases are most useful
native infrastructure with far more data and users. Business as operational stores where scale, availability, and disaster recovery
expectations have changed from eight-hour-per-day uptime to requirements exceed the capabilities of a traditional relational
24/7 global availability with virtually no downtime/maintenance database. For example, Samsung uses a distributed SQL database
windows. The relational systems of yesteryear are not up to the to store its customer information for their Samsung cloud service, a
task in terms of scalability, availability, resilience, or performance photo and information service similar to Apple's iCloud. ShortStack
under load. NoSQL databases do not offer the robust functionality uses a distributed SQL database to handle their user data for running
or transactional integrity required for systems of record. online contests. Example use cases include:

ABOUT DISTRIBUTED SQL E-commerce data User interaction, transaction, product


A distributed SQL database is a relational database that distributes
Financial services Trade and transaction, fraud prevention, customer
data and processing across multiple servers, containers, or virtual and account information
machines (VMs). They offer the same ACID guarantees of traditional
General business Supply chain, inventory, financial, customer and
relational database management systems (RDBMSs) along with
account information
the scale and availability of a distributed database. Compared
to traditional relational databases, they offer greater scale and
reliability, and compared to NoSQL databases, they offer more
robust functionality and consistency. Inherent to distributed SQL
databases is the use of SQL as a query language.

Distributed SQL databases are distinct from some other types of non-
traditional relational databases. For instance, Amazon Aurora allows
only a single writer with many replicas or two writers (multi-master)
with no additional replicas. Aurora relies on shared storage for
reliability and scalability. The term "NewSQL" was previously used to
be more inclusive of other types of databases, including in-memory
databases like VoltDB. While keeping all or most data in memory
can lead to lower latency and is good for specialized use cases, it is
not cost-effective for applications at a greater level of scale. Some
NewSQL databases are actually analytical stores.

1
XPAND YOUR
EXPECTATIONS
Distributed SQL now available in SkySQL

Get started with a $500 credit:


mariadb.com/skyview

SkySQL is the only DBaaS capable of deploying MariaDB as a distributed SQL database for
scalable, high-performance transaction processing or as a multi-node columnar database for
data warehousing and ad hoc analytics. SkySQL makes it easy to start small and scale when
needed, as much as needed – whether it’s the result of continued business growth or an
exponential surge (e.g., successful Black Friday/Cyber Monday promotions).
REFCARD | GETTING STARTED WITH DISTRIBUTED SQL

HOW DISTRIBUTED SQL WORKS Figure 2


Distributed SQL databases use a hashing algorithm to assign writes to
different units called partitions (or slices in some databases). Figures
1 and 2 show how those partitions are distributed among multiple
compute nodes such as VMs, containers, or physical hardware. Each
partition is replicated to at least two nodes (generally more).

Figure 1
When a client reads from a distributed SQL database, the database
computes the hash and selects one or more nodes to surface the
requested data. Likewise, queries may also be similarly distributed
among multiple nodes in the database. Because data is distributed,
reads can pull from multiple storage devices at the same time. An
example is shown in Figure 3.

Figure 3

In order to ensure data is consistent when written or updated, the replicas are distributed among cloud availability zones (or different
database uses a type of distributed transaction protocol similar to racks in private data centers). However, while some databases
two-phase commit. Modern distributed SQL databases primarily support distributing replicas among geographic regions, replication
use a consensus algorithm such as Paxos or Raft. These protocols over large distances results in significant latency.
coordinate membership in the cluster along with ensuring that data
To address this, an eventually consistent latency-tolerant replication
is written to the correct nodes in order to guarantee data consistency
protocol is used across data centers (see Figure 4 on the next page).
and reliability. Distributed SQL databases work best in the cloud if

3 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH DISTRIBUTED SQL

Figure 4

SHARED CHARACTERISTICS OF RELATIONAL MODEL


DISTRIBUTED SQL DATABASES Distributed SQL databases use a relational model, in which:
While no two distributed SQL database products are exactly alike,
•  Data is represented in tables, rows, and columns
they do have shared characteristics that distinguish them from other
•  Records are rows and fields are columns
types of databases. First and foremost, distributed SQL databases
are operational stores as opposed to analytical stores. •  A unique identifier, called a primary key, identifies each row

•  Shared values, called foreign keys, join data between tables


Though some distributed SQL databases are combined with analytical
stores, that functionality is outside of distributed SQL itself — similar As with some traditional relational databases, the underlying storage
to how some traditional relational databases supply full-text search. may be substantially different than what is represented (see Figure 5).

Figure 5

4 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH DISTRIBUTED SQL

Despite the similarity and intentional compatibility, there are DBaaS formation, as a customer install, and even hybrid installations
often differences in how data is modeled compared to traditional where the DBaaS can manage local instances and replicate between
relational databases. The most obvious is that sequences are a private data center and a cloud installation, and vice versa.
highly discouraged because generating a sequence across a
COMPATIBILITY
distributed cluster creates a bottleneck that hampers scalability
Distributed SQL databases strive to be compatible with existing
and performance. Instead, natural keys or randomly generated
traditional RDBMSs. However, similar to the previous generation of
unique keys are preferred.
relational databases, there are differences in dialects, data types,
GENERAL ARCHITECTURE and extended functionality like procedural languages. Leading
Distributed SQL databases are based on the same general distributed SQL databases have varied approaches to address
architecture. Data is stored on multiple nodes. Writes are balanced compatibility.
between those nodes and assigned via a hashing algorithm, while
MariaDB Xpand, for example, offers two topologies: one that serves
reads are likewise balanced. Data is replicated to more than one node,
as a compatible storage engine for the existing MariaDB Enterprise
so a distributed SQL database can survive the loss of one or more
Server, and the other as a "performance topology" that circumvents
nodes. Writes and updates are handled via a distributed transaction
the front end. The "compatibility mode" offers the strongest
that is coordinated among nodes. Some combination of client-side
compatibility with MySQL and MariaDB (along with extensions
proxies or a load balancer directs traffic between database nodes.
for Oracle's PL/SQL). The "performance topology" offers higher
ACID TRANSACTIONS performance and scale and lower latency.
Unlike other distributed database technologies (i.e., NoSQL),
CockroachDB attempts to be wire compatible with PostgreSQL but
distributed SQL databases are designed for systems of record.
reimplements the query engine to distribute processing, which is
They supply transactional integrity and strong consistency from
similar to Xpand. Yugabyte preserves the PostgreSQL front end and
the ground up with coordinated writes, locked records, and other
uses it for query processing in a way similar to MariaDB Xpand in
methods such as multi-version concurrency control.
compatibility mode.
SYNCHRONOUS REPLICATION
Distributed SQL databases use synchronous replication between For complex applications migrating to distributed SQL, an existing

nodes to ensure transactional integrity with continuous availability. traditional RDBMS front end in compatibility mode may make
When a write takes place, each node acknowledges the write. Other the most sense, particularly if you're using extended features of a
similar types of databases, like Amazon Aurora, use asynchronous traditional database. However, if you're running in production over
replication, which could cause inconsistent writes between nodes. the long term, migrating to a performance topology is likely a better
option than using an existing front end.
QUERY DISTRIBUTION
Compared to client-server database technologies, distributed CONSENSUS ALGORITHM
SQL database queries are replicated to any number of database In the early 2010s, NoSQL databases were widely popular for their
nodes. Additionally, data can be pulled from multiple nodes and scalability features. However, they relaxed transactional consistency
aggregated into a single result set. Some distributed SQL databases and removed key database features, including joins. While adoption
even distribute processing parts of complex queries (i.e., joins, of NoSQL was swift for applications where scale and concurrency
subqueries) to different nodes. were the most important factors, most mission-critical applications
that required transactional integrity remained in client-server
DIFFERENCES BETWEEN DISTRIBUTED databases like Oracle, MySQL, PostgreSQL, and SQL Server.
SQL DATABASES
While the basic architectural approach of distributed SQL databases Meanwhile, ongoing research into the Paxos consensus algorithm
is easily recognized and distinct from both NoSQL and traditional and database design made higher-scale, transactionally correct
relational databases, there are some key differences between them. relational databases possible. Unfortunately, Paxos is considered
hard to implement. Other algorithms, including Calvin and Raft, were
DELIVERY (CLOUD/DBAAS, ON-PREMISES,
also developed. Calvin is not ideal for dynamic queries, which are
OR HYBRID)
common in SQL databases. Raft proved to be easier to implement and
At this time, every distributed SQL database can be installed in the
is used by most distributed SQL databases, except MariaDB Xpand
cloud; however, not all of them offer a fully managed database-as-
and Google Spanner.
a-service (DBaaS). Some distributed SQL databases are available in

5 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH DISTRIBUTED SQL

There is continued discussion and academic research into which COLUMNAR INDEXES/MIXED
algorithm is "better," but for the most part, the difference lies in WORKLOAD SUPPORT

the implementation details, which is not of great interest to most Distributed SQL databases are operational or transactional

database developers and administrators. It should be noted that databases by nature. However, by adding columnar indexes,

the application of this technology is what made distributed SQL distributed SQL databases can handle real-time analytical queries.

databases possible. Consider the case of e-commerce: The majority of queries will be
light reads and writes, but eventually, someone will want to report
Early distributed SQL implementations include the Clustrix database, on the sales or types of customer engagements — or even offload
originally available as an appliance, MySQL Cluster, and Google's summaries into a data warehouse. These are long-running analytical
Spanner. Spanner requires hardware atomic clocks in order to work. queries that may benefit from a columnar index. Most distributed
Most distributed SQL databases evolved clock synchronization and SQL databases do not yet have this capability, but it can be expected
drift detection algorithms and no longer require hardware-based to become more commonplace as developers look to consolidate
atomic clocks, which allows them to be deployed on general use and simplify their data architecture.
hardware and cloud computing services.
Figure 6
SCALABILITY
The distributed SQL architecture enables horizontal scalability;
however, implementation details have a large impact on production
reality. The key to scalability is how data is assigned to nodes and
how data is rebalanced over time. Additionally, load balancing plays
a central role in both scalability and performance.

Some databases rely on the client to "know" which node to


address. Others require traditional IP load balancers or use more
sophisticated database proxies that understand more about the EVALUATING DISTRIBUTED SQL
underlying database. DATABASES
The most important aspect of designing a proof of concept (PoC)
FAULT RECOVERY
is to focus on data and queries that closely match your actual
All distributed SQL databases are largely fault-tolerant. However,
application. There is a temptation to test the platform's limits with
they differ in what happens during a fault. Does the client have to
unrealistic queries (e.g., 15 joins with six tables that pulls back 1M
retry the failed transactions, or can they be recovered and replayed?
rows or a single row point query) and measure the performance
How long does it take for the database to rebalance data between
between different systems. Database technologies make trade-offs
nodes in the event one is lost?
and optimize for particular usage patterns. In the case of distributed
SQL, the database optimizes for throughput of transactional volume.
KUBERNETES
The major distributed SQL implementations support Kubernetes, but In designing a PoC, actual production data and application traffic is
implementation and performance varies between them based on optimal. Second best is a simulation that closely matches the general
how IOPS are handled. While some allow bare-metal installations, pattern in terms of table structure, query complexity, and proportion
self-healing and other functionality is limited or lost when running of reads and writes. It is important to set goals beyond a single factor
without Kubernetes. such as pure database latency and focus on overall application
performance at nominal and peak usage.
MULTIMODAL
Strictly speaking, multi-modal functionality is not a distributed This means that if at nominal use, a traditional database offers 1ms
SQL function but is based on whether ancillary processing or data latency but 1,000ms at peak usage, and the application performs at
storage types are provided with the database and how consistency 4s but has a performance goal of 3s, it is not meeting the objective. If a
guarantees apply to that functionality. Examples include column distributed SQL database performs at 15ms under nominal usage but
storage, analytics, and document storage. If a distributed SQL performs at 20ms at peak usage — and the application meets its 3s
database provides these additional features, it's possible to combine goal — it has met the requirement. In generating load, it is essential to
real-time analytics along with operational capabilities. ensure that the infrastructure can generate sufficient load to test the
database system capacity at the intended performance goal.

6 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH DISTRIBUTED SQL

For instance, if observed latency increases significantly at 1,000


transactions per second, but overall resource utilization of disk, CPU, WRITTEN BY ANDREW OLIVER,
and network do not appear to be bottlenecked, it may be that the SR. DIRECTOR OF PRODUCT MARKETING, MARIADB

load generation infrastructure is maxed rather than the system under Andrew C. Oliver is the Senior Director of
test. It is equally essential to ensure the client network and other Product Marketing for MariaDB. He is a prolific
writer about technology — particularly open-
infrastructure between the load generator and system under test
source and distributed database technologies.
have sufficient capacity.
In the past, he served on the board of the Open Source Initiative,
founded Apache POI, and was an early part of JBoss, Inc. before
COST CONSIDERATIONS
its acquisition by Red Hat.
Evaluating cost is more complex than simply reviewing licensing, cost
Find him over on Twitter @acoliver.
per hour, or any other vendor-advertised measure. It is important to
consider the entire cost of the system, including factors such as:

•  Staff training
•  Ongoing maintenance
•  Risk of loss of service during a failure
•  Downtime during upgrades
•  Support and support quality DZone, a Devada Media Property, is the resource software developers,
engineers, and architects turn to time and again to learn new skills, solve
•  IOPS for cloud services software development problems, and share their expertise. Every day,
hundreds of thousands of developers come to DZone to read about the latest
technologies, methodologies, and best practices. That makes DZone the ideal
LEARN MORE place for developer marketers to build product and brand awareness and
drive sales. DZone clients include some of the most innovative technology
Distributed SQL databases are one of the hottest new technologies and tech-enabled companies in the world including Red Hat, Cloud Elements,
Sensu, and Sauce Labs.
in cloud computing. They offer transactional integrity without
sacrificing scalability and are built for reliability in the cloud. This
new technology makes it possible to bring applications that require Devada, Inc.
600 Park Offices Drive
a system of record to the cloud. The following resources provide Suite 300
Research Triangle Park, NC 27709
additional information on distributed SQL databases:
888.678.0399 | 919.678.0300
•  "Distributed SQL"
Copyright © 2021 Devada, Inc. All rights reserved. No part of this publication
https://en.wikipedia.org/wiki/Distributed_SQL may be reproduced, stored in a retrieval system, or transmitted, in any form
or by means of electronic, mechanical, photocopying, or otherwise, without
•  "What You Need to Know About Distributed SQL" prior written permission of the publisher.
https://dzone.com/articles/what-you-need-to-know-about-
distributed-sql

7 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy