The Analyst's Guide To Amazon Redshift: Periscope Data Presents

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

PERISCOPE DATA PRESENTS

The Analyst’s Guide to


Amazon Redshift
Periscope and Redshift Node Type
At Periscope Data we’ve tried all kinds of databases in search of First, decide what type of node you’ll use—Dense Compute or Dense
speed, from custom hardware to cloud storage. Despite everything Storage. Compute nodes have more ECU and memory per dollar
we tried, Amazon Redshift won out each time. We even built our than storage nodes but come with far less storage. Speed is of the
Cache on it. Suffice to say, we have a lot of experience with it and utmost importance at Periscope Data, so we’ve found these to be the
have developed expert opinions around optimal setups. There are most effective. The more data you’re querying, the more compute
two aspects of Redshift we really appreciate: columnar storage and you need to keep queries fast. Storage nodes can work well if you
distributed architecture. have too much data to fit on SSD nodes within your budget, or if you
want to store a lot more data than you expect to query.

Columnar Storage
Many databases store data by row, which requires you to read a Number of Nodes
whole table to sum a column. Now you’ll need to figure out how many nodes to use. This depends
somewhat on your dataset, but for single query performance, the
Redshift stores its data by column. Since columns are stored more the merrier.
separately, Redshift can ignore columns that aren’t referenced by
the current query. The less data you need to read from disk, the The size of your data will determine the smallest cluster you can
faster your query will run. have. Compute nodes only come with 160GB drives. Even if your row
count is in the low billions, you may still require 10+ nodes.

Distributed Architecture
Redshift stores each table's data in thousands of chunks called Network Setup
blocks. Each block can be read in parallel. This allows the Redshift The last step is network setup. Clusters in US East (North Virginia)
cluster to use all of its resources on a single query. do not require a VPC , but the rest do. For any production usage, we
suggest using a VPC , as you’ll get better network connectivity to
When reading from disk, a Redshift cluster can achieve much higher your EC2 instances.
input/output operations per second (IOPS ), since each node reads
from a different disk and the IOPS sums across the cluster. A default VPC will be created if one doesn’t exist. If you want to access
Redshift from outside of AWS, then add a public IP by setting Publicly
In this guide, we’ll walk you through the basics of Redshift, and offer Accessible to true. Whether you want a public IP on your cluster is
step-by-step instructions on how to get started. Let’s kick things off up to you. We’ll cover both public and private IPs in this guide.
with setting up a cluster.
In either case, take note of the VPC Security group. You’ll need to
allow access to the cluster through it later.
Getting Started With Redshift: Cluster
Configuration
The first step to using Redshift is to set up your cluster. The most EC2 Classic Setup
important choices you’ll make during cluster setup are the types of We’ll start with the simplest cluster setup possible—a cluster in
nodes to use—we recommend Dense Compute—and the network Virginia not in any VPC. This kind of setup is best used for prototyping.
security settings for your cluster. Once the cluster boots, the Configuration tab in the AWS Redshift
Your security settings will determine how to connect to your cluster console will show you the endpoint address.
once it’s running. If you choose to make Redshift publicly accessible,
you’ll need to whitelist the IPs in your cluster’s network security
group. If your cluster has a private IP in a VPC , you’ll need to set up
and connect through a bastion host.

Setting up Your Cluster


Setting up a Redshift cluster is easy! The details of connecting to
your Redshift cluster vary depending on how you set it up, but the
basics are the same.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


VPC: Private IP
In a VPC with a private IP setup, only connections made from inside
the network are allowed. This is the most common setup we’ve seen.

There are a few ways to connect to these clusters. The easiest is to


SSH into a box in the VPC with a public IP address—often called a
Bastion host—and run psql there. The SSH program creates an
encrypted connection, which lets you run commands and forward
network data on remote machines.

To use graphical tools or a client-side copy command, you’ll need a


way to forward traffic through the Bastion host. If you already have a
VPN running AWS , you can connect through there—just make sure
the VPN instance is in a security group that can talk to the Redshift
cluster. Otherwise, you can create your own instance to forward
through. Either Windows or Linux will work as your Bastion host, but
Before connecting, we need to allow the IP in the Cluster Security Linux will be much easier to set up.
Groups. Click the link, then click Add Connection Type. The default
will be your current IP .

Linux SSH Server


First, launch a Linux distro—we recommend Ubuntu if you don’t
have a favorite. You can use a T2 Micro for this, though we recom-
mend a T2 Medium for serious querying.

Log in by clicking the Connect button in the AWS console and


following these instructions:

• Create a user account. -m creates a home directory, which


you’ll need to store your public key for connecting.

$ sudo useradd periscope -m -s /bin/false

• Become the user to install your public key. -s sets a shell that
quits, so the user can forward ports, but not run commands.
Now connect directly to your cluster:

$ sudo su -s /bin/bash periscope


psql -h \ $ mkdir ~/.ssh
periscope-test.us-east-1.redshift.amazonaws.com \ $ cat - >> ~/.ssh/authorized_keys
-p 5439 -U periscope dev

• Paste your public key, press Enter, then press Ctrl-d. Alterna-
And we’re in! tively, you can copy the file there.

• Permissions are very important for the authorized_keys file. Its


VPC: Public IP contents allow someone to connect this machine as your user,
If your cluster is in a VPC with a public IP (whether it's an Elastic IP so it’s only valid if editing is restricted to your user.
or not), there’s one more step: Head to the VPC ’s security group for Make sure that only your user has write permissions to your
this cluster, and whitelist port 5439 for your IP address. home directory and .ssh folder. For good measure, remove all
permissions from the authorized_keys.

$ chmod 600 ~/.ssh/authorized_keys


$ chmod go-w ~/ ~/.ssh

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


If you want to lock down only that tunnel, you can in the SSH Client
authorized_keys file: SSH has an option called local port forwarding, which causes your
SSH client to open a port on your computer and forward any network

no-pty, permitopen="foo.us-east1.amazonaws.com:5439" traffic received to the server. In this case, the server forwards that
ssh-rsa AAAAB3NzaC1y...Rdo/R user@clientbox connection to the database.

No-pty is a step beyond using /sbin/false as the shell—it Mac/Linux


restricts the user from even opening a virtual terminal, so you • On Mac/Linux, invoke local port forwarding with -L local-
have to connect with -nTN . port:redshift-host:redshift-port. You can choose any port
greater than 1024; for this example we chose 5439. -nNT stops
For more information and help with troubleshooting, visit the the SSH client from opening a shell and allows you to back-
Ubuntu community site. ground the connection by starting with an & at the end.

$ ssh bastion.us-east1.amazonaws.com \
Windows SSH Server -L5439:foo.us-east-1.amazonaws.com:5439 -nNT
Windows doesn’t come with an SSH server pre-installed. We recom-
mend using freeSSHd—it’s free and easier to set up than OpenSSHd.
• After the connection starts working, connect using localhost as
the hostname and 5439 as the port.

• Using psql type the following:

psql -h 127.0.0.1 -p 5439 database_name

• If you’re using SQL Workbench/J instead, use the following settings:

For more details on port forwarding—and cool tricks like the reverse
• In the Server Status tab, make sure the SSH server is running. tunnel—check the Ubuntu wiki.
In the Users tab, create a user. Set Authorization to Public Key
and make sure to allow Tunnelling.
• In the Tunneling tab, enable Local Port Forwarding. In the
Authentication tab, set Public key authentication to Required,
then open the public key folder.
• Copy your public key to a file with the same name as the user.
The name has to match exactly, so take out any file extension.
• Make sure the public key is in the correct folder and has the
correct name. You may also need to restrict it to administrator
only. If your changes don’t seem to be taking effect, make sure
you’re running as an administrator.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Windows SSH Client Now your user needs an inline policy to define what resources it can
PuTTY is our go-to Windows SSH client. To set up tunneling in access. The JSON blob below is a sample policy we created with the
PuTTY , expand the SSH client of the menu on the left, then open the policy generator for S3 resources.
Tunnels menu.
Change my-s3-bucket in the JSON blob to your S3 bucket name and
• Source port can be anything you’d like—we’ve chosen 5439 in add it as a Custom Policy under Inline Policies.
this example.
• For Destination, use your Redshift hostname and port, separat-
{"Statement": [
ing the two with a colon. {
• Click Add and save the profile by clicking Open.
"Effect": "Allow",
• Then connect, using SQL Workbench/J just as above.
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::my-s3-bucket/*"
],
"Sid": "Stmt1234567"
}
]}

Then make sure your bucket and Redshift cluster are in the same
region, or you’ll incur data transfer charges.

Once your bucket is ready, it’s time to create a table!

Creating the Table


Redshift doesn’t support as many column types as most row-store
databases. For supported types and their value ranges, take a look
at Redshift’s documentation on data types.

With a simple table, the column type translation is pretty straight­


forward. For JSON , BINARY , and other column types not supported
by Redshift, you can store them as NULL to keep the schemas
consistent, or varchar(max) if you need the data.
And you’re in! Any of the no delimiter or missing column errors can indicate an
unescaped newline or control character. Truncated lines that show
in the dump file can indicate an unescaped NULL which Redshift
Importing and Exporting Data cannot process, even in quotes.
Now that you have your cluster, it’s time to load some data! As
Redshift is primarily used as a data warehouse, the data usually
comes from another system. In this section, we’ll cover imports
from primary-serving databases like MySQL and Postgres, as well
as more general load strategies and query patterns.

Creating an IAM User


The easiest way to get data into Redshift begins with uploading CSVs
to Amazon S3. In the AWS Identity and Access Management (IAM )
console, create an account with access to an S3 bucket.

Create a new user and save the Access Key ID and Access Key Secret.
These keys are needed to let Redshift retrieve files from your S3 bucket.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Getting Data In: The COPY Command For example, this handy query will show you the two most
To get data into Redshift, start by staging it in Amazon S3. Once your recent errors:
data’s there, you’ll take advantage of the all-purpose COPY command.
Here’s an example. select starttime, filename, err_reason, line_number,
colname, type, col_length, position, raw_field_value,
COPY my_schema.my_table raw_line, err_code
FROM 's3://bucket_name/path/to/my_manifest' from stl_load_errors
WITH CREDENTIALS order by starttime desc limit 2;
'aws_access_key_id=<my_access_key>;
aws_secret_access_key=<my_secret_key>'
REGION 'us-east-1' If you don’t see your COPY command in the results, check the Loads
MANIFEST tab in the AWS console or try again. The errors are fairly useful, but
GZIP if you don’t see what you’re looking for, look up the err_code value in
ACCEPTANYDATE Amazon’s documentation.
TRUNCATECOLUMNS
ACCEPTINVCHARS

Dumping From MySQL or Postgres


To get our data out of MySQL , we start by escaping control charac-
Let’s dissect the above. ters and delimiters with a slash and separating fields with a comma.
The first couple of lines are straightforward. Immediately after COPY Note that with these escape characters, MySQL will output NULL s
comes the schema name and table name. Immediately after FROM as \N . Use mysqldump to get data out of MySQL .
comes the path to your data’s manifest file in S3.

The WITH CREDENTIALS line contains your AWS account mysqldump -- tab .
credentials. These credentials must have read access to the S3 bucket -- fields-escaped-by=\\
in question. -- fields-terminated-by=,
dbname tablename
The next three keywords clarify some things about the data.

• REGION specifies the AWS region of your S3 bucket. The default


is the AWS region of your Redshift cluster. This creates a dump file. Upload your file to S3, create the table in
• MANIFEST specifies that the path after FROM is to a manifest file.
Redshift, and load the data in using the following command:
• GZIP indicates that the data is gzipped.
COPY schema.table FROM 's3://path/to/dump.csv'
The final three keywords serve to maximize the success rate of WITH CREDENTIALS 'aws_access_key_id=<your-access-
the import. key-goes-here>;
aws_secret_access_key=<your-secret-key-goes-here>'
• ACCEPTANYDATE allows any date format to be parsed in
NULL as '\\N'
datetime columns.
ESCAPE;
• TRUNCATECOLUMNS will truncate text columns rather than
error if the data is too long.
• ACCEPTINVCHARS will replace invalid Unicode characters with
? rather than erroring.

You can adjust these toggles to taste, but in our experience, failed Dumping From Postgres
loads are quite frustrating. We recommend some flexibility on the Postgres can dump CSV -like files from psql using \copy, which
data rather than endless ETL headaches. creates a local file. Just run this command in psql:

\copy table to 'filename' csv header null as '\N'

stl_load_errors
As you load tables, you might run into an error or two. The Redshift
stl_load_errors table contains most of the recent errors that
occurred during a COPY . Since stl_load_errors is a very wide table,
we recommend you use \x auto to enable the extended display.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Upload your file to S3, create the table in Redshift, and load the data Immediately after the UNLOAD keywords, enter the query whose
in using the following command: results you want to export as a string surrounded by parentheses.

Similar to COPY , you must use WITH CREDENTIALS to specify


COPY schema.table FROM 's3://path/to/dump.csv'
credentials that may write to your S3 bucket.

WITH CREDENTIALS 'aws_access_key_id=<your-access-
key-goes-here>; The next three keywords modify the format of the export itself:
aws_secret_access_key=<your-secret-key-goes-here>'
CSV; • MANIFEST includes a file listing the dumped files. Redshift will
export two files per node (one per slice). A master list can be
helpful for reloading via COPY as well as for other programs
reading the data.
• GZIP compresses the files, making them much easier to work with.
Troubleshooting • ALLOWOVERWRITE proceeds with the export even if the file
Here are a few common issues you might run into while loading, and already exists.
how you can get around them:
The final two keywords deal with the data:
For invalid characters, add ACCEPTINVCHARS to the COPY com-
mand. ACCEPTINVCHARS replaces invalid characters with “?” per • ESCAPE puts an escape backslash before newlines, quotes, etc.
byte, so the length is unchanged. • NULL AS uses a special format for null values instead of
writing whitespace.
• For out-of-range datetimes, use ACCEPTANYDATE . These are That’s all there is to it! Now go forth and copy large amounts of data.
common when exporting from MySQL , since “00/00/00
00:00:00” is valid there, but not in Redshift.
• If your CSV has column headers, use IGNOREHEADER 1.
• For out-of-range numbers, NULL (‘\0’) characters, or other
Helpful Admin Queries
data that Redshift cannot ingest, you’ll have to fix it at the source. You have a running Redshift cluster. You’ve loaded data and run
queries. Now you want to know what’s going on inside the cluster.
And now your data’s in Redshift! Which queries are running? Which nodes are getting full? Are some
queries taking way too long because they’re not using the correct
sort or dist keys?
Exporting Data The AWS Redshift console is helpful, but it doesn’t have all the
answers. The system tables in your Redshift cluster have data about
Getting Data Out: The UNLOAD Command actively running and recently completed queries, table sizes, node
Nearly as common as getting data in is getting data out. Sometimes, capacity, and overall performance metrics.
the results of hard computations done in Redshift are necessary for In this section, we’ll show you how to query Redshift’s system tables
serving systems. Other times, a large export is needed for analysis to answer these kinds of questions.
in Excel or other tools.

To get lots of data out, you’ll want the UNLOAD command. Here’s
an example: Redshift-Specific System Tables
It’s easy to treat Redshift as a black box—queries go in, answers
UNLOAD ('select * from my_table') come out. When something goes wrong, though, you’ll want to open
TO 's3://bucket_name/path/to/my_filename_prefix' the hood and see what Redshift is actually doing.
WITH CREDENTIALS
To dig into any issues, each Redshift cluster provides virtual system
'aws_access_key_id=<my_access_key>;
tables you can query. Like Postgres, Redshift has the information_
aws_secret_access_key=<my_secret_key>'
schema and pg_catalog tables, but it also has plenty of Red-
MANIFEST
shift-specific system tables.
GZIP
ALLOWOVERWRITE
ESCAPE
NULL AS '\\N'

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


All Redshift system tables are prefixed with stl_, stv_, svl_, or svv_: select
query,
• The stl_ prefix denotes system table logs. stl_ tables contain starttime,
logs about operations that happened on the cluster in the text,
past few days. sequence
• The stv_ prefix denotes system table snapshots. stv_ tables
from stl_query
contain a snapshot of the current state of the cluster. join stl_querytext using (query)
• The svl_ prefix denotes system view logs. svl_ views join some
order by query,sequence
number of system tables to provide more descriptive info. limit 5;
• The svv_ prefix denotes system view snapshots. Like svl_, the
svv_ views join some system tables to provide more descriptive info.

Current Cluster Status Deadlocks


One of the most common reasons to log into the Redshift console is If your cluster has a suspiciously long-running update, it may be in a
to kill a misbehaving query. To find which queries are currently in deadlocked transaction. The stv_locks table will indicate any
progress, check the stv_inflight table. transactions that have locks along with the process ID of the
relevant sessions. This PID can be passed to pg_terminate_back-
end(pid) to kill the offending session.
select
userid To inspect the locks, order them by oldest first.
, query
, pid select
, starttime table_id
, left(text, 50) as text , last_update
from stv_inflight , last_commit
, lock_owner_pid
, lock_status
You’ll end up with a table like this: from stv_locks
order by last_update asc

To terminate the session, run select pg_terminate_backend(lock_


owner_pid), using the value from stl_locks.

Connection Issues
Debugging connection issues is never fun. Fortunately, Redshift has
To kill a query, use the cancel <pid> <msg> command. Be sure to a few tables that make up for the lack of a network debugging tool.
use the process ID —PID in the table above—and the query ID . You
can supply an optional message which will be returned to the issuer The stv_sessions table lists all the current connections, similar to
of the query and logged. Postgres’s pg_stat_activity. While useful, it doesn’t have the actual
connection information for host and port. That can be found in
Redshift also stores the past few days of queries in svl_qlog if you stl_connection_log. This table has a list of all connects, authenti-
need to go back further. The stv_recents view has all the recent queries cates, and disconnects on your cluster. Joining these tables returns
with their status, durations, and PID for currently running queries. a list of sessions and remote host information.

All of these tables only store the first 200 characters of each query.
The full query is stored in chunks in stl_querytext. Join this table in
by query, and sort by query_id and sequence to get each 200-charac-
ter chunk in order.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


select distinct • max_varchar shows the size of the largest varchars. While
starttime varchars compress well, they can force a temporary result
, process which otherwise fits in RAM to be stored on disk, reducing
, user_name query performance.
, '169.254.21.1' remotehost
, remoteport For help on changing the sortkeys and distkeys of your giant tables,
from stv_sessions check out this post.
left join stl_connection_log
on pid = process
and starttime > recordtime - interval '1 second'
order by starttime desc
Copying Tables
If you want to copy or split a table, Redshift supports both create
Running this query will produce a table like the one below: table like and create table as syntax.

create table like copies the structure, compression, distribution,


and sortkey. This is great for archiving tables as it keeps the
compression settings.

create table as creates a table and fills it with the given query.
You can supply optional sortkeys and distkeys. Note that this won’t
compress the table, even if the source tables are compressed.
create table as is best for small, temporary tables, since
­compression helps with performance for an upfront cost.

Query Performance
create table events_201404 as (
select *
STL_ALERT_EVENT_LOG from events
The stl_alert_event_log table is important for optimizing queries. where created_at >= '2014-04-01' and created_at <
When the cluster executes your query, it records problems found by '2014-05-01'
the query planner into stl_alert_event_log along with suggested fixes. );
Some problems can be fixed by running analyze or vacuum, while create table events_201404 like events;
others might require rewriting the query or changing your schema. insert into events_201404 (
select *
from events
where created_at >= '2014-04-01' and created_at <
SVV_TABLE_INFO
'2014-05-01'
svv_table_info returns extended information about the state on disk
);
of your tables. This table can help troubleshoot low-performing
tables. While we recommend regular vacuuming and other mainte-
nance, you can also use this table as a guide for when to vacuum.
To create a compressed table from a query after using create table
Here are the column names you’ll see in the svv_table_info table: as, run analyze compression, create a new table with those encod-
ings, and then copy the data into a new table. Alternatively, unload
• empty shows how many blocks are waiting to be freed by a vacuum the data somewhere, and load it back with copy.
• unsorted shows the percent of the table that is unsorted. The
cluster will need to scan this entire section for every query. You
need to vacuum to re-sort and bring this back to 0.
• sortkey1_enc lists the encoding of the first sortkey. This can
Managing Query Load
sometimes affect lookup performance. Not all queries need to be treated equally. Some are more important
• skew_sortkey1 shows the ratio of the size of the first column of
than others, and some require additional resources to be successful.
the sortkey to the size of the largest non-sortkey column, if a Redshift allows you to configure priority queues for your queries
sortkey is defined. You can use this value to evaluate the through its Workload Management (WLM ) interface. You can
effectiveness of the sortkey. separate your data-import jobs from users running analysis queries,
• skew_rows shows the ratio of rows from most on a slice to least
and make sure both have the resources they need to complete.
on a slice. Use it to evaluate distkey.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


In this section, we’ll show you how to configure and monitor WLM to Non-ETL queries will run in Queue #3—which is the default—
make the most of your cluster’s resources. because they have not specified otherwise. Since that queue has the
default concurrency of five, we can expect each slot to get one-fifth
of the remaining 60% memory—12% of cluster memory apiece.
Creating ETL Job Queues
Let’s start by creating some queues for our ETL jobs. For this walk-
through, we’ll assume we have two types: small, quick jobs that run Checking Our Work
frequently, and a few big, heavy ones that require serious resources. We’ve set up what we think is the ideal slot structure. Now we want
to make sure it works.
The first step is to create a group for all our ETL users:
First, let’s get the service classes for our queues:
create group etl with etl_ro, etl_rw
select
id
Next, we need to set up WLM queues and assign them based on our , condition
group. WLM settings are part of the Parameter Group Settings, , action_service_class
which you can find through the Parameter Group link in the sidebar from STV_WLM_CLASSIFICATION_CONFIG
of the Redshift console. where action_service_class > 4

Select a parameter set or create a new one, and click Edit WLM .
We get the following results:

Defining Queues
Let’s start with two new queues for our two types of ETL jobs. Queue
#1 will be the big job queue; we’ll give it one slot and 20% of memory.

Queue #2 will be for all other ETL jobs; we’ll give it concurrency of
two and 20% memory, meaning each job will get 10% memory.
Here’s what our WLM settings look like:

The first condition is the special super user queue. If all query slots
are taken, the super user can still run queries with set query_group
to superuser. This allows one query at a time with a reserved portion
of memory.

Now that we have each queue’s service class, we can match it up to


Notice that queries from users in the etl user group will automati- the number of query slots and total memory.
cally run in Queue #2.
select
name
, service_class
Assigning Queries to Queues , num_query_tasks as slots
Queries run in the default queue for their user group. For users in , query_working_mem
the etl group, that’s Queue #2. from STV_WLM_SERVICE_CLASS_CONFIG
where service_class > 4
Users can also set their query queue on a per-session basis with set
query_group. For example, to run in Queue #1:
Here are our results:

set query_group to slow_etl_queue

For our set of queues, we might have all ETL jobs use a user account
in the etl user group, and have slow queries set their query_group
before running. This way, slow ETLs use the slow queue, and all
other ETLs use the standard ETL queue.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


The name “Service Class #X ” maps to “Queue #X ” in the Redshift Performance Tips
console, and of course service_class maps to our previous query and Redshift is usually very fast. But even with all the power of a large
other Redshift system tables. service_class one, two, three, and four Redshift cluster, queries can still get slow.
are reserved for internal use, and service_class five is always the
super user queue. This might be because the query is generating so much temporary
data that it needs to write its intermediate state to disk. Or perhaps
Finally, we’ll look at the queries in the general-purpose queue, the query shares a common core with several other queries that
service_class eight: also need to run, and resources are wasted recomputing that
common data for every query.
select
The most common issue for query underperformance is when they
userid
do not use the tables’ sortkeys and distkeys. These two keys are the
, query
closest things to indices in Redshift, so skipping them can cost a
, service_class
lot of time.
, slot_count
, total_queue_time In this section, we’ll show you how to find wasteful queries, materi-
, total_exec_time alize common data that many queries share, and choose the right
, service_class_start_time sortkeys and distkeys.
from stl_wlm_query
order by service_class_start_time desc
limit 5 Materialized Views
The best way to make your SQL queries run faster is to have them do
less work. And a great way to do less work is to query a materialized
That will produce the below results: view that’s already done the heavy lifting.

Materialized views are particularly nice for analytics queries, where


many queries do math on the same basic atoms, data changes
infrequently (often as part of hourly or nightly ETLs), and ETLs jobs
provide a convenient home for view creation and maintenance logic.

Redshift doesn’t yet support materialized views out of the box, but
with a few extra lines in your import script—or through a tool like
Periscope Data—creating and maintaining materialized views as
tables is a breeze.
Notice that some queries are using multiple slots, as shown in the
slot_count column. That’s because those queries have their wlm_ Let’s look at an example. Lifetime daily average revenue per user
slot_count set above one, instructing them to wait for multiple slots (ARPU ) is a common metric and often takes a long time to compute.
to be open and then consume all of them. We’ll use materialized views to speed this up.

Time is logged first for how long the query will be in the queue, and
then the time it will take to execute. Both of those numbers can be
very helpful in debugging slowdowns. Calculating Lifetime Daily ARPU
This common metric shows the changes in how much money you’re
making per user over the lifetime of your product.

When Not to Use WLM


WLM is great at partitioning your cluster’s resources, but be wary of
creating too many queues. If one queue is often full and another
empty, then you’ll be wasting cluster capacity. In general, the more
constraints you give your Redshift cluster, the less flexibility it will
have to optimize your workload.

Finally, keep an eye on the total_queue_time field of stl_wlm_query


and consider increasing the number of query slots for queues and
service class with high queue times.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


To calculate this, we’ll need a purchases table and a gameplays Now that setup is done, we can calculate lifetime daily ARPU :
table, as well as the lifetime accumulated values for each data.
Here’s the SQL for calculating lifetime gameplays: with
lifetime_gameplays as (...),
with lifetime_purchases as (...)
lifetime_gameplays as ( select
select lifetime_gameplays.d as date
dates.d , round(
, count(distinct gameplays.user_id) as count_users lifetime_purchases.sum_purchases /
from ( lifetime_gameplays.count_users, 2)
select distinct date(created_at) as d as arpu
from gameplays from lifetime_purchases
) as dates inner join lifetime_gameplays
inner join gameplays on lifetime_purchases.d = lifetime_gameplays.d
on date(gameplays.created_at) <= dates.d order by lifetime_gameplays.d
group by d
),
That’s a monster query, and it takes minutes to run on a database
with two billion gameplays and three million purchases! That’s way
The range join in the correlated subquery lets us recalculate the too slow, especially if we want to quickly slice by dimensions like
distinct number of users for each date. gaming platform. Similar lifetime metrics will need to recalculate
the same data over and over again, so it’s in our best interest to
Here’s the SQL for lifetime purchases in the same format: speed this query up.

lifetime_purchases as (
select Easy View Materialization on Redshift
dates.d Fortunately, we’ve written our query in a format that makes it
, sum(price) as sum_purchases
obvious which parts can be extracted into materialized views:
from (
lifetime_gameplays and lifetime_purchases. We’ll fake view
select distinct date(created_at) as d
materialization in Redshift by creating tables, which can be easily
from purchases
created from snippets of SQL .
) as dates
inner join purchases
create table lifetime_purchases as (
on date(purchases.created_at) <= dates.d
select
group by d
dates.d
)
, sum(price) as sum_purchases
from (
select distinct date(created_at) as d
from purchases
) as dates
inner join purchases
on date(purchases.created_at) <= dates.d
group by d
)

Do the same thing for lifetime_gameplays, and calculating lifetime


daily ARPU now takes less than a second to complete!

Redshift is well-suited to this kind of optimization because data on a


cluster usually changes infrequently.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Unless you have a tool like Periscope Data, you’ll need to drop and Digging in
recreate these tables every time you upload product data to your To dig in, we’ll open up our Redshift console and check the run time
Redshift cluster to keep them fresh. for our query sample.

Disk-Based Temporary Tables


Columnar stores like Redshift achieve high speeds by reading only
the columns needed to complete a query. The best speedups are
achieved when these columns and the intermediate results fit in
RAM . But they degrade if the intermediate results exceed the
available RAM and get written to disk.

Your query is likely exceeding the available RAM it if causes spikes in


your disk usage graph.

To see the execution details, run:


The disk space spikes as temporary tables are created and
destroyed, slowing our queries in the process.
select
Redshift keeps detailed statistics on each query execution, available query
in the system views svl_query_report and svl_query_summary. , step
, rows
These tables are keyed on query with the ID found in the Redshift , workmem
console or by selecting from svl_qlog. The svl_query_summary view , label
is a summarized version of svl_qlog. , is_diskbased
from svl_query_summary
To find recent queries that are hitting disk, run:
where query = 127387
order by workmem desc
select
query
, substring The workmem is the upper bound on memory the query planner
from svl_qlog
requires for each step, not the amount actually used. If workmem
join svl_query_summary using(query)
exceeds the available RAM , the step will use disk for intermediate
where starttime > date(getdate()) - interval '1 day'
results. This is indicated by the is_diskbased column.
and is_diskbased = 't';

The amount of RAM available is based on the number of query slots


allocated to this query. By default, Redshift uses five query slots and
allocates one slot per query. Many of the steps are using disk, so we
need to optimize this query.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Mitigation More Efficient Queries
There are always two approaches to solving memory constraints: This query counts the distinct number of users who have post or visit
add more memory, or use less. Adding more memory is expensive events each day. It takes 75 seconds and four of its steps are disk-
and using less can be difficult—or even impossible. Some queries based. Instead of the query slots, let’s make the query more efficient.
will always need to use disk.
select
For the queries that could fit in RAM, here are some mitigation strategies:
date(event_time) as date
, count(distinct case when event_type = 'Post'
then user_id else null end) as posts_created
Update Database Statistics , count(distinct case when event_type = 'Visit'
workmem is only the estimate of space needed by the query and is then user_id else null end) as visits
based on the statistics the database collects about each table. from events
where event_time >= '2014-05-21'
Make sure to run analyze on your tables to keep their internal
group by 1
statistics up to date. This reduces extraneous padding from outdated
row count estimates.

The best way to improve the performance of any query is to reduce


the amount of data that’s stored and read during its execution. This
Making More RAM Available query only cares about visits with an event_type of ‘Post’ or ‘Visit’.
Adding more memory is done by either adding more nodes to your Everything else will resolve to null and not be counted in the distinct.
cluster or by increasing the wlm_query_slot_count. This can be
done per session using set wlm_query_slot_count or per using Adding a where clause to filter for only these two event types greatly
parameter groups. speeds up this query. It runs in about five seconds and doesn’t hit
the disk at all.
Increasing the wlm_query_slot_count from one to five gives this
query access to all of the cluster’s RAM . The workmem has
increased for all steps, and most are no longer disk-based.
Sortkeys and Distkeys
Like a lot of folks in the data community, we’ve been impressed with
Redshift. But at first, we couldn’t figure out why performance was so
variable on seemingly simple queries.

The key is carefully planning each table’s sortkey and distribution key.

A table’s distkey is the column on which it’s distributed to each node.


Rows with the same value in this column are guaranteed to be on
the same node.

A table’s sortkey is the column by which it’s sorted within each node.

Requiring all of the query slots means that this query needs to wait
until all five slots are available before it can run. This waiting can
eliminate the performance improvements of the additional RAM on a A Naive Table
busy cluster. Our 1B -row activity table is set up this way.

create table activity (


id integer primary key,
created_at_date date,
device varchar(30)
);

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


A common query is: How much activity was there on each day, split Now our table will be distributed according to created_at_date, and
by device? each node will be sorted by created_at_date. The same query runs
on this table in eight seconds—a solid 20% improvement.
select
Because each node is sorted by created_at_date, it only needs one
created_at_date
counter per device. As soon as the node is done iterating over each
, device
date, the values in each device counter are written out to the result,
, count(1)
because the node knows it will never see that date again.
from activity
group by created_at_date, activity And because dates are unique to each node, the leader doesn’t have
order by created_at_date; to do any math over the results. It can just concatenate them and
send them back.

Our approach is validated by lower CPU usage across the board.


In Periscope Data, this would produce a graph like the one below:

But what if there were a way to require only one counter? Fortunate-
ly, Redshift allows multi-key sorting.

On a cluster with eight dw2.large nodes, this query takes 10 seconds. create table activity (
To understand why, let’s turn to Redshift’s CPU Utilization graph. id integer primary key,
created_at_date distkey,
device varchar(30)
)
sortkey (created_at_date, device);

Our query runs on this table in five seconds—a 38% improvement


over the previous table, and a 2x improvement from the native query.
That is a ton of CPU usage for a simple count query!
Once again, the CPU chart will show us how much work is required.
The problem is our table has no sortkey and no distkey. This means
Redshift has distributed our rows to each node round-robin as
they’re created, and the nodes aren’t sorted at all.

As a result, each node must maintain thousands of counters—one


for each date and device pair. Each time it counts a row, it looks up
the right counter and increments it. On top of that, the leader must
aggregate all the counters. This is where all of our CPU time is going.
Only one counter is required. As soon as a node is done with a date
and device pair it can be written to the result, because that pair will
Smarter Distribution and Sorting never be seen again.
Let’s remake our table with a simple, intentional sortkey and distkey: Of course, choosing a table’s sortkey for one query might seem like
overkill. But choosing intentionally always beats the naive case. And
create table activity ( if you have a few dimensions you use a lot in your analysis, it’s worth
id integer primary key, including them.
created_at_date date sortkey distkey,
device varchar(30)
);

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Maintenance A Lot of Data Was Deleted
You’ll occasionally need to run some maintenance tasks to keep your Unlike Postgres, the default vacuum operation in Redshift is vacuum
Redshift cluster running at peak performance. full. This operation reclaims dead rows and re-sorts the table.

If you’re familiar with other kinds of databases, you already know If you’ve recently deleted a lot of rows from a table, you might just
about vacuuming. You’ll need to vacuum your Redshift tables if their want to get the space back. You can use a delete-only vacuum to
data has changed, or if you’ve added a lot of new rows. compact the table without spending the time to re-sort the
remaining rows.
And as the data of your tables changes, you’ll want to recheck the
compression settings. If a table isn’t using the best possible vacuum delete only events
compression, it’ll take up too much space and slow queries down.
You can see how many rows were deleted or re-sorted from the
In this section, we’ll show you when—and when not—to vacuum, most recent vacuums by querying svv_vacuum_summary.
how to recheck compression settings, and how to keep an eye on
disk usage so you know when to upgrade to a larger Redshift cluster. select * from svv_vacuum_summary
where table_name = 'events'

When Not to Vacuum


Most guidance around vacuuming says to do it as often as necessary. And it’s always a good idea to analyze a table after a major change to
When in doubt, we recommend nightly. But vacuum operations can its contents.
be very expensive on the cluster and greatly reduce query performance.
You can skip vacuuming tables in certain situations, like when: analyze events

Data Is Loaded in Sortkey Order


When new rows are added to a Redshift table, they’re appended to
the end of the table in an unsorted region. For most tables, this Rechecking Compression Settings
means you have a bunch of rows at the end of the table that need to When you copy data into an empty table, Redshift chooses the best
be merged into the sorted region of the table by a vacuum. compression encodings for the loaded data. As data is added and
deleted from that table, the best compression encoding for any
You don’t need to vacuum when appending rows in sortkey order; for column might change.
instance, if you’re adding new rows to an events table that’s sorted
by the event’s time, the rows are already sorted when they’re added What used to make sense as a bytedict might now be better off as a
in. In this case, you don’t need to re-sort this table with a vacuum delta encoding if the number of unique values in the column has
because it’s never unsorted. grown substantially.

To see the current compression encodings for a table, query


pg_table_def:
A Lot of Data Is Unsorted
If it’s been a long time since you vacuumed the table, or if you’ve select "column", type, encoding
appended a ton of unsorted data, it can be faster to copy the table from pg_table_def
than to vacuum it. where tablename = 'events'

You can recreate the table with all the same columns, compression
encodings, distkeys and sortkeys with create table:
And to see what Redshift recommends for the current data in the
table, run analyze compression:
create table events_copy (like events);
insert into events_copy (select * from events);
analyze compression events
drop table events;
alter table events_copy rename to events

Then simply compare the results to see if any changes are


recommended.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708


Redshift doesn’t currently have a way to alter the compression Now you can either drop unnecessary tables or resize your cluster to
encoding of a column. You can add a new column to the table with have more capacity!
the new encoding, copy over the data, and then drop the old column:

alter table events add column device_id_new integer Conclusion


delta; Congratulations on your first Redshift setup! This guide has covered
update events set device_id_new = device_id; the basics a data analyst will need to get started. It all begins with
alter table events drop column device_id; configuration and ensuring that your Redshift cluster inherits the
alter table events rename column device_id_new to security measures to keep your data safe. Then you’ll need to
device_id; manage getting data into—or out of—Redshift. While native mea-
sures offer the most flexibility, ETL tool can serve as a
valuable supplement to provide greater performance. Once your
data is loaded, you’ll need to focus on managing the cluster to
uphold ultimate availability and performance.
Monitoring Disk Space
If your cluster gets too full, queries will start to fail because there We hope this guide helps you exert more control over your data
won’t be enough space to create the temp tables used during the workflows—turning raw data into beautiful and insightful charts
query execution. Vacuums can also fail if there isn’t enough free at a rapid pace. Want to take your data team to the next level?
space to store the intermediate data while it’s getting re-sorted. Sign up for a free trial of Periscope Data to do even more with Redshift.

To keep an idea on how much space is available in your cluster via


SQL , query stv_partitions:

select sum(used)::float / sum(capacity) as pct_full


from stv_partitions

And to see individual table sizes:

select t.name, count(tbl) / 1000.0 as gb


from (
select distinct datname id, name
from stv_tbl_perm
join pg_database on pg_database.oid = db_id
) t
join stv_blocklist on tbl=t.id
group by t.name order by gb desc

About Periscope Data


Periscope Data is the analytics system of record for professional data teams. The platform enables data leaders, analysts and engineers to
unlock the transformative power of data using SQL and surface actionable business insights in seconds—not hours or days. Periscope Data
gives them control over the full data analysis lifecycle, from data ingestion, storage and management through analysis, visualization and
sharing. The company serves nearly 900 customers including Adobe, New Relic, EY , ZipRecruiter, Tinder and Flexport, with analysts spending
the equivalent of two business days per week in the platform to support data-driven decision-making. Periscope Data is headquartered in San
Francisco, CA . For more information, visit www.periscopedata.com.

periscopedata.com | hello@periscopedata.com | @periscopedata 201708

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy