Coffing Data Warehousing Software

Amazon Redshift Tutorial: Learning Through Puzzles and Examples (Part Three)

This post is the third part of a three-part series on an Amazon Redshift Data Warehouse.

Amazon Web Services has made tuning your Redshift Cluster easy.  When you use the create table command to create a permanent table or even a temp table, you control the table’s distkey and sortkey, and each column’s data type.  These systems use two strategies for tuning.

The first strategy comes automatically, and that is that Redshift captures the minimum value and the maximum value for every column (except character data) for every columnar block of data.  This allows for blocks to be skipped if a query with a WHERE clause falls outside of the min/max metadata range.

The second strategy is that Redshift allows the DBA team and users to sort tables with a single-sortkey, compound sortkey, or an interleaved sortkey.  If a table is sorted strategically, queries will be faster because the metadata will skip the majority of blocks that fall outside the min/max metadata stored for each block. If you skip half the blocks for a query you have doubled the query speed! This is especially effective when using dates!  If you feel the need for speed, then sort the table with advice I heed!

 

Quiz – How Many Blocks in Total are Read into Memory?

Amazon Redshift slices build metadata for each column block for data types that involve numbers and dates.  The metadata consists merely of the min/max values in the column block. This allows the optimizer to skip over certain blocks if the query’s WHERE clause falls out of the range of min/max values.  Below, you can see the query and the slices. How many blocks in total will be placed into memory?

Answer – How Many Blocks in Total are Read into Memory?

The WHERE clause is looking for a DeptNo = 56.  Amazon Redshift will use the metadata to search if a DeptNo block could possibly hold a value of 56.  This is how the metadata is used and why it is captured. Performance tuning is all about eliminating blocks on disk from being read into memory.  The fewer blocks being moved into memory the faster the query. Below, the metadata for DeptNo shows that only one DeptNo block falls in the min/max range of 56.  Therefore, only slice 2 is called upon and it moves in the DeptNo, Name, and Soc column blocks into memory.
Improving Performance By Defining a Sort Key

There are three basic reasons to use the sortkey keyword when creating a table.  1) If recent data is queried most frequently, specify the timestamp or date column as the leading column for the sort key.  2) If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a (dimension) table, specify the join column as the sort key.  Below, you can see we have made our sortkey the Order_Date column. Look at how the data is sorted!

Sort Keys Help Group By, Order By, and Window Functions

When data is sorted on a strategic column, it will improve (GROUP BY and ORDER BY operations), window functions (PARTITION BY and ORDER BY operations), and even as a means of optimizing compression. But, as new rows are incrementally loaded, these new rows are sorted but they reside temporarily in a separate region on disk.  In order to maintain a fully sorted table, you need to run the VACUUM command at regular intervals. You will also need to run ANALYZE so that the host has information about the table so it can build a better plan when the table is queried.

Creating Three Tables with Different Sort Key Strategies

The three types of sort keys are single, compound and interleaved.  A single-column sortkey sorts on a single column and performs like an ORDER BY statement.  The compound sortkey, which is a multi-column sort key, sorts as a normal ORDER BY statement too, but the interleaved sortkey is a multi-column sortkey, but it gives equal power to all the keys.  Read on.

A Distribution Key and a Single-Sortkey

Pretend that the table below has grown big enough (not really but for example’s sake) so that it takes up four blocks on a slice.  If the distkey is Prod_ID and the sortkey is Sale_Date the blocks might look like the below example.

Quiz – How Many Blocks Must Be Read into Memory?

How many blocks will be read into memory when the query (in yellow) is run on the example below.  

Answer – How Many Blocks Must Be Read into Memory?

All four Prod_ID blocks must be read into memory to satisfy the query.

A Distribution Key and a Compound-Sortkey

Pretend that the table below has grown big enough (not really but for example’s sake) so that it takes up four blocks on a slice.  If the distkey is Prod_ID and the compound sortkey is Prod_ID and Sale_Date the blocks will look like the example.

Quiz – How Many Blocks are Moved into Memory?

How many blocks move into memory on the below queries?

 

Answer – How Many Blocks are Moved into Memory?

How many blocks move into memory on the below queries? Only one block for the first query. Four blocks for the second.

 

A Distribution Key and an Interleaved Sortkey

Pretend that the table below has grown big enough (not really but for example’s sake) so that it takes up four blocks on a slice.  If the distkey is Prod_ID and the interleaved sortkey is Prod_ID and Sale_Date the blocks will look similar to the example below. A normal sort key has a Major and Minor sort and the only time the minor sort is applied is if there is a duplicate value in the Major sort.  The Major sort has all the power and the minor sort is merely a supporting actor. An Interleaved sort gives equal power to all the sort keys.

Quiz – How Many Blocks are Moved into Memory?

Can you guess how many blocks will need to be moved into memory to satisfy the query?  Can you then see the purpose of an Interleaved sortkey?

Answer – How Many Blocks are Moved into Memory?

Both queries were satisfied by only having to move two blocks into memory.  You utilize an Interleaved sortkey when your users query multiple columns in the WHERE clause about the same amount of time.  If you have 100 queries with a WHERE clause for the table and 50 queries reference the Prod_ID in the WHERE clause and the other 50 queries reference the Sale_Date you should consider an Interleaved sortkey on both columns.

The Greatest Technique for Amazon Redshift is Nexus and the NexusCore Server

The Nexus and NexusCore Server combination provides the greatest invention for moving data between cloud and on-premises systems.  The Nexus client is every user’s power tool. Users can move data to and from Amazon Redshift and all other systems on the cloud or on-premises.  The NexusCore Server will run the data movement jobs for the user as the user. This allows for companies to place as many NexusCore Servers strategically within their company firewalls or directly on the Amazon or Azure clouds.  Users can also use the Nexus Super Join Builder to drop-and-drag tables from all cloud and on-premises systems and Nexus will write the SQL, conversions, and data movement necessary to bring back an answer set. The Nexus on the desktop builds it all automatically and then the NexusCore Server runs the query from the server.  This design gives the user on the desktop incredible power and the NexusCore Server runs the job for the user as if the user was sitting on top of the server!


Voila! You now understand Amazon Redshift. If you’d like me (Tera-Tom) to come teach at your organization, please get in touch with the Coffing Data Warehousing team here.