The Analyst's Guide To Amazon Redshift: Periscope Data Presents
The Analyst's Guide To Amazon Redshift: Periscope Data Presents
The Analyst's Guide To Amazon Redshift: Periscope Data Presents
Columnar Storage
Many databases store data by row, which requires you to read a Number of Nodes
whole table to sum a column. Now you’ll need to figure out how many nodes to use. This depends
somewhat on your dataset, but for single query performance, the
Redshift stores its data by column. Since columns are stored more the merrier.
separately, Redshift can ignore columns that aren’t referenced by
the current query. The less data you need to read from disk, the The size of your data will determine the smallest cluster you can
faster your query will run. have. Compute nodes only come with 160GB drives. Even if your row
count is in the low billions, you may still require 10+ nodes.
Distributed Architecture
Redshift stores each table's data in thousands of chunks called Network Setup
blocks. Each block can be read in parallel. This allows the Redshift The last step is network setup. Clusters in US East (North Virginia)
cluster to use all of its resources on a single query. do not require a VPC , but the rest do. For any production usage, we
suggest using a VPC , as you’ll get better network connectivity to
When reading from disk, a Redshift cluster can achieve much higher your EC2 instances.
input/output operations per second (IOPS ), since each node reads
from a different disk and the IOPS sums across the cluster. A default VPC will be created if one doesn’t exist. If you want to access
Redshift from outside of AWS, then add a public IP by setting Publicly
In this guide, we’ll walk you through the basics of Redshift, and offer Accessible to true. Whether you want a public IP on your cluster is
step-by-step instructions on how to get started. Let’s kick things off up to you. We’ll cover both public and private IPs in this guide.
with setting up a cluster.
In either case, take note of the VPC Security group. You’ll need to
allow access to the cluster through it later.
Getting Started With Redshift: Cluster
Configuration
The first step to using Redshift is to set up your cluster. The most EC2 Classic Setup
important choices you’ll make during cluster setup are the types of We’ll start with the simplest cluster setup possible—a cluster in
nodes to use—we recommend Dense Compute—and the network Virginia not in any VPC. This kind of setup is best used for prototyping.
security settings for your cluster. Once the cluster boots, the Configuration tab in the AWS Redshift
Your security settings will determine how to connect to your cluster console will show you the endpoint address.
once it’s running. If you choose to make Redshift publicly accessible,
you’ll need to whitelist the IPs in your cluster’s network security
group. If your cluster has a private IP in a VPC , you’ll need to set up
and connect through a bastion host.
• Become the user to install your public key. -s sets a shell that
quits, so the user can forward ports, but not run commands.
Now connect directly to your cluster:
• Paste your public key, press Enter, then press Ctrl-d. Alterna-
And we’re in! tively, you can copy the file there.
no-pty, permitopen="foo.us-east1.amazonaws.com:5439" traffic received to the server. In this case, the server forwards that
ssh-rsa AAAAB3NzaC1y...Rdo/R user@clientbox connection to the database.
$ ssh bastion.us-east1.amazonaws.com \
Windows SSH Server -L5439:foo.us-east-1.amazonaws.com:5439 -nNT
Windows doesn’t come with an SSH server pre-installed. We recom-
mend using freeSSHd—it’s free and easier to set up than OpenSSHd.
• After the connection starts working, connect using localhost as
the hostname and 5439 as the port.
For more details on port forwarding—and cool tricks like the reverse
• In the Server Status tab, make sure the SSH server is running. tunnel—check the Ubuntu wiki.
In the Users tab, create a user. Set Authorization to Public Key
and make sure to allow Tunnelling.
• In the Tunneling tab, enable Local Port Forwarding. In the
Authentication tab, set Public key authentication to Required,
then open the public key folder.
• Copy your public key to a file with the same name as the user.
The name has to match exactly, so take out any file extension.
• Make sure the public key is in the correct folder and has the
correct name. You may also need to restrict it to administrator
only. If your changes don’t seem to be taking effect, make sure
you’re running as an administrator.
Then make sure your bucket and Redshift cluster are in the same
region, or you’ll incur data transfer charges.
Create a new user and save the Access Key ID and Access Key Secret.
These keys are needed to let Redshift retrieve files from your S3 bucket.
The WITH CREDENTIALS line contains your AWS account mysqldump -- tab .
credentials. These credentials must have read access to the S3 bucket -- fields-escaped-by=\\
in question. -- fields-terminated-by=,
dbname tablename
The next three keywords clarify some things about the data.
You can adjust these toggles to taste, but in our experience, failed Dumping From Postgres
loads are quite frustrating. We recommend some flexibility on the Postgres can dump CSV -like files from psql using \copy, which
data rather than endless ETL headaches. creates a local file. Just run this command in psql:
stl_load_errors
As you load tables, you might run into an error or two. The Redshift
stl_load_errors table contains most of the recent errors that
occurred during a COPY . Since stl_load_errors is a very wide table,
we recommend you use \x auto to enable the extended display.
To get lots of data out, you’ll want the UNLOAD command. Here’s
an example: Redshift-Specific System Tables
It’s easy to treat Redshift as a black box—queries go in, answers
UNLOAD ('select * from my_table') come out. When something goes wrong, though, you’ll want to open
TO 's3://bucket_name/path/to/my_filename_prefix' the hood and see what Redshift is actually doing.
WITH CREDENTIALS
To dig into any issues, each Redshift cluster provides virtual system
'aws_access_key_id=<my_access_key>;
tables you can query. Like Postgres, Redshift has the information_
aws_secret_access_key=<my_secret_key>'
schema and pg_catalog tables, but it also has plenty of Red-
MANIFEST
shift-specific system tables.
GZIP
ALLOWOVERWRITE
ESCAPE
NULL AS '\\N'
Connection Issues
Debugging connection issues is never fun. Fortunately, Redshift has
To kill a query, use the cancel <pid> <msg> command. Be sure to a few tables that make up for the lack of a network debugging tool.
use the process ID —PID in the table above—and the query ID . You
can supply an optional message which will be returned to the issuer The stv_sessions table lists all the current connections, similar to
of the query and logged. Postgres’s pg_stat_activity. While useful, it doesn’t have the actual
connection information for host and port. That can be found in
Redshift also stores the past few days of queries in svl_qlog if you stl_connection_log. This table has a list of all connects, authenti-
need to go back further. The stv_recents view has all the recent queries cates, and disconnects on your cluster. Joining these tables returns
with their status, durations, and PID for currently running queries. a list of sessions and remote host information.
All of these tables only store the first 200 characters of each query.
The full query is stored in chunks in stl_querytext. Join this table in
by query, and sort by query_id and sequence to get each 200-charac-
ter chunk in order.
create table as creates a table and fills it with the given query.
You can supply optional sortkeys and distkeys. Note that this won’t
compress the table, even if the source tables are compressed.
create table as is best for small, temporary tables, since
compression helps with performance for an upfront cost.
Query Performance
create table events_201404 as (
select *
STL_ALERT_EVENT_LOG from events
The stl_alert_event_log table is important for optimizing queries. where created_at >= '2014-04-01' and created_at <
When the cluster executes your query, it records problems found by '2014-05-01'
the query planner into stl_alert_event_log along with suggested fixes. );
Some problems can be fixed by running analyze or vacuum, while create table events_201404 like events;
others might require rewriting the query or changing your schema. insert into events_201404 (
select *
from events
where created_at >= '2014-04-01' and created_at <
SVV_TABLE_INFO
'2014-05-01'
svv_table_info returns extended information about the state on disk
);
of your tables. This table can help troubleshoot low-performing
tables. While we recommend regular vacuuming and other mainte-
nance, you can also use this table as a guide for when to vacuum.
To create a compressed table from a query after using create table
Here are the column names you’ll see in the svv_table_info table: as, run analyze compression, create a new table with those encod-
ings, and then copy the data into a new table. Alternatively, unload
• empty shows how many blocks are waiting to be freed by a vacuum the data somewhere, and load it back with copy.
• unsorted shows the percent of the table that is unsorted. The
cluster will need to scan this entire section for every query. You
need to vacuum to re-sort and bring this back to 0.
• sortkey1_enc lists the encoding of the first sortkey. This can
Managing Query Load
sometimes affect lookup performance. Not all queries need to be treated equally. Some are more important
• skew_sortkey1 shows the ratio of the size of the first column of
than others, and some require additional resources to be successful.
the sortkey to the size of the largest non-sortkey column, if a Redshift allows you to configure priority queues for your queries
sortkey is defined. You can use this value to evaluate the through its Workload Management (WLM ) interface. You can
effectiveness of the sortkey. separate your data-import jobs from users running analysis queries,
• skew_rows shows the ratio of rows from most on a slice to least
and make sure both have the resources they need to complete.
on a slice. Use it to evaluate distkey.
Select a parameter set or create a new one, and click Edit WLM .
We get the following results:
Defining Queues
Let’s start with two new queues for our two types of ETL jobs. Queue
#1 will be the big job queue; we’ll give it one slot and 20% of memory.
Queue #2 will be for all other ETL jobs; we’ll give it concurrency of
two and 20% memory, meaning each job will get 10% memory.
Here’s what our WLM settings look like:
The first condition is the special super user queue. If all query slots
are taken, the super user can still run queries with set query_group
to superuser. This allows one query at a time with a reserved portion
of memory.
For our set of queues, we might have all ETL jobs use a user account
in the etl user group, and have slow queries set their query_group
before running. This way, slow ETLs use the slow queue, and all
other ETLs use the standard ETL queue.
Redshift doesn’t yet support materialized views out of the box, but
with a few extra lines in your import script—or through a tool like
Periscope Data—creating and maintaining materialized views as
tables is a breeze.
Notice that some queries are using multiple slots, as shown in the
slot_count column. That’s because those queries have their wlm_ Let’s look at an example. Lifetime daily average revenue per user
slot_count set above one, instructing them to wait for multiple slots (ARPU ) is a common metric and often takes a long time to compute.
to be open and then consume all of them. We’ll use materialized views to speed this up.
Time is logged first for how long the query will be in the queue, and
then the time it will take to execute. Both of those numbers can be
very helpful in debugging slowdowns. Calculating Lifetime Daily ARPU
This common metric shows the changes in how much money you’re
making per user over the lifetime of your product.
lifetime_purchases as (
select Easy View Materialization on Redshift
dates.d Fortunately, we’ve written our query in a format that makes it
, sum(price) as sum_purchases
obvious which parts can be extracted into materialized views:
from (
lifetime_gameplays and lifetime_purchases. We’ll fake view
select distinct date(created_at) as d
materialization in Redshift by creating tables, which can be easily
from purchases
created from snippets of SQL .
) as dates
inner join purchases
create table lifetime_purchases as (
on date(purchases.created_at) <= dates.d
select
group by d
dates.d
)
, sum(price) as sum_purchases
from (
select distinct date(created_at) as d
from purchases
) as dates
inner join purchases
on date(purchases.created_at) <= dates.d
group by d
)
The key is carefully planning each table’s sortkey and distribution key.
A table’s sortkey is the column by which it’s sorted within each node.
Requiring all of the query slots means that this query needs to wait
until all five slots are available before it can run. This waiting can
eliminate the performance improvements of the additional RAM on a A Naive Table
busy cluster. Our 1B -row activity table is set up this way.
But what if there were a way to require only one counter? Fortunate-
ly, Redshift allows multi-key sorting.
On a cluster with eight dw2.large nodes, this query takes 10 seconds. create table activity (
To understand why, let’s turn to Redshift’s CPU Utilization graph. id integer primary key,
created_at_date distkey,
device varchar(30)
)
sortkey (created_at_date, device);
If you’re familiar with other kinds of databases, you already know If you’ve recently deleted a lot of rows from a table, you might just
about vacuuming. You’ll need to vacuum your Redshift tables if their want to get the space back. You can use a delete-only vacuum to
data has changed, or if you’ve added a lot of new rows. compact the table without spending the time to re-sort the
remaining rows.
And as the data of your tables changes, you’ll want to recheck the
compression settings. If a table isn’t using the best possible vacuum delete only events
compression, it’ll take up too much space and slow queries down.
You can see how many rows were deleted or re-sorted from the
In this section, we’ll show you when—and when not—to vacuum, most recent vacuums by querying svv_vacuum_summary.
how to recheck compression settings, and how to keep an eye on
disk usage so you know when to upgrade to a larger Redshift cluster. select * from svv_vacuum_summary
where table_name = 'events'
You can recreate the table with all the same columns, compression
encodings, distkeys and sortkeys with create table:
And to see what Redshift recommends for the current data in the
table, run analyze compression:
create table events_copy (like events);
insert into events_copy (select * from events);
analyze compression events
drop table events;
alter table events_copy rename to events