Amazon Redshift Spectrum for Analyzing Data in Amazon S3
Over time, enterprises can accumulate a lot of data that gets buried underneath the pile of ‘hot’ data. This ‘dark data’ can hold valuable business insights, which means, analysts need solutions that give them access to petabytes of dark data.
Today, there are three primary ways to access and analyze data in S3:
-
Amazon Elastic MapReduce (EMR): EMR uses Hadoop-style queries to access and process large data sets in S3.
-
Amazon Athena: Athena offers a console to query S3 data with standard SQL and no infrastructure to manage. Athena also has an API.
-
Amazon Redshift: You can load data from S3 into an Amazon Redshift cluster for analysis.
So why not use these existing options for analyzing data in S3? For example, companies already use Amazon Redshift to analyze their “hot” data. So Why not load the cold data from S3 into Redshift for analysis?
There are two reasons why doing so becomes unfeasible, especially as your data volume grows:
-
Effort: Loading data into Amazon Redshift involves extract, transform and load (ETL) steps. ETL is necessary to convert and structure data for analysis. Amazon estimates that figuring out the right ETL consumes 70% of an analytics project.
-
Cost: You may not even know what data to extract until you have analyzed it a bit. Uploading lots of cold S3 data for analysis requires growing your clusters. That translates to extra cost, since Redshift pricing is based on the size of your cluster. Meanwhile, you continue to pay S3 storage charges for retaining your cold data.
Amazon Redshift Spectrum lets you query data stored in Amazon S3 without first loading it into Amazon Redshift. For nomenclature purposes, we’ll use “Redshift” for “Amazon Redshift”, and “Spectrum” for “Amazon Redshift Spectrum”.
Redshift Spectrum offers the best of both worlds. With Spectrum, you can:
- Continue using your analytics applications with the same queries you’ve written for Redshift.
- Leave cold data in S3, and query it via Amazon Redshift, without ETL processing. You can even join data from your data lake with data in Redshift, using a single query.
- Decouple processing from storage. Since there’s no need to increase cluster size, you can save on Redshift storage.
- Pay only when you run queries against S3 data. Spectrum queries cost $5 /terabyte of data processed.
Spectrum is the “glue” that provides Redshift an interface to S3 data. Redshift is the access layer for your business applications. Spectrum is the query processing layer for data accessed from S3. The following picture illustrates the relationship between these services.
A Closer Look: How Does Amazon Redshift Spectrum Work?
From a deployment perspective, Spectrum is “under the hood”. It’s a group of managed nodes in your private VPC, available to any of your Redshift clusters that are Spectrum-enabled. It pushes compute-intensive tasks down to the Redshift Spectrum layer. That layer is independent of your Amazon Redshift cluster.
There are three key concepts to understand how to run queries with Redshift Spectrum:
- External data catalog
- External schemas
- External tables
The external data catalog contains the schema definitions for the data you wish to access in S3. It’s a central metadata repository for your data assets. Potential options for your data catalog are:
The external schema contains your tables. External tables allow you to query data in S3 using the same SELECT syntax as with other Amazon Redshift tables. External tables are read-only, i.e. you can’t write to an external table.
You don’t have to write fresh queries for Spectrum. You can keep writing your usual Redshift queries. The only change with Spectrum is that the queries should contain a reference to data stored in S3.
Joining Internal and External Tables with Amazon Redshift Spectrum
The Redshift query engine treats internal and external tables the same way. You can do the typical operations, such as queries and joins on either type of table, or a combination of both. For example, query an external table and join its data with that from an internal one.
Let’s say you are using Redshift to analyze data of your e-commerce website’s visitors. You are analyzing metrics such as what pages they visit, how long they stay, and what they buy. You keep a year’s worth of data in your Redshift clusters. You move the older data to S3.
During the course of the analysis, you notice an odd seasonal variation. You want to see if this was also true for past years, or if it is just an aberration for this year. Fortunately, you have saved historic clickstream data in S3, going back several years. You can now access that historic data via an external table with Spectrum, and run the same queries you’re running in Amazon Redshift. You can also create new insights by joining data from previous years with this year’s data.
Redshift parses, compiles and distributes an SQL query to the nodes in a cluster, in the usual manner. The part of the query that references an external table is sent to Spectrum. Spectrum processes the relevant data in S3, and sends the result back to Redshift. Redshift collects the partial results from its nodes and Spectrum, concatenates, joins, etc., and returns the complete result.
Related reading: What are the best data transformation tools for Amazon Redshift?
Amazon Redshift Spectrum: Why It Makes Sense
If you are an Amazon Redshift customer, here are 4 primary reasons for bringing Spectrum into your data analytics stack:
- Your business applications remain unchanged and don’t know how or where a query is running. The only change that happens for a business analyst is defining access to external tables.
- External data remains in S3. There is no need for ETL to load the data into your Redshift cluster. This results in the decoupling of your storage layer in S3 from your processing layer in Redshift and Spectrum.
- You don’t need to increase the size of your Redshift cluster to process data in S3. You only pay for the S3 data your queries actually access.
- Redshift does all the hard work of minimizing the number of Spectrum nodes needed to access the S3 data. It also makes processing between Redshift and Spectrum efficient.
Given the benefits, it is important to optimize your S3 storage layer for efficiency and cost savings. Partitioning your data, compressing it, and converting it to columnar formats such as Apache Parquet, are all techniques to improve performance. When data is stored in columnar formats such as Apache Parquet, Spectrum only reads the columns that are relevant to the query. It does not scan through all the columns in a table. Since Spectrum charges you according to the amount of data scanned, using Apache Parquet for data storage can reduce query costs substantially.
Spectrum is the perfect tool to take a deep dive into your historical data and unearth more business insights. And it lets you do that at a very low price/query.