In a digitally powered economy, having access to data can help companies navigate market changes, perform customer analytics and adjust their strategy to meet demand. Unfortunately, most of the data company’s generate are unstructured and siloed across various departments in the organization. According to Forbes, 95 percent of businesses cite the need to manage unstructured data as a problem for their business. Amazon Redshift is a powerful data warehousing solution that stands out among competitors. The platform helps companies bring together their data in meaningful ways to support informed decision-making. Here is a brief tutorial on how to set up Amazon Redshift for your organization.
Amazon Redshift Key Concepts
Amazon Redshift is a petabyte-scale data warehouse service based on PostgreSQL hosted in the AWS cloud. It is a suite of computing resources called nodes and clusters. Each cluster runs one or more databases. The platform supports a variety of client connections to on-premise and cloud systems. These tools support many business use cases including business intelligence (BR), reporting, big data, and analytics. A notable success story for the platform is iHeart Radio. The company leveraged Redshift to streamline its ETL workflow for over 130 data flows. Before we get started, let’s review a few concepts about the platform.
Connections
The platform supports client application connections via a variety of industry-standard connection methods. These methods include JDBC and ODBC drivers for PostgreSQL.
Compute Nodes
A node represents an individual compute resource that has a dedicated CPU, memory, and disk storage. AWS offers concurrency scaling to accommodate increased resource usage.
Leader Node
When there is more than one node, the system appoints one as a leader node. The leader node manages and coordinates with the worker nodes to carry out database operations. Once it receives a query, the leader compiles the code and distributes the code to worker nodes. Each worker node carries out a portion of the work to return results for the query.
Node Slices
Nodes are partitioned into what are known as slices. Each slice is allocated a portion of the node’s memory and disk space to accomplish the work assigned to it. Each slice on the node works in parallel to execute the operation.
Clusters
A cluster in Redshift is a collection of one or more nodes. Each cluster runs the Redshift engine and hosts one or more databases.
Internal Network
Redshift provides high-performance workload management by taking advantage of proximity and high-bandwidth connections over a private network. This reduces latency and allows the leader and worker nodes to communicate directly over an isolated network.
Databases
Each cluster houses one or more databases. The databases are stored on compute nodes.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Prerequisites for Setting Up Amazon Redshift on AWS
Before getting started, there are a couple of things you will need to do. The first step is signing up for AWS and second configuring your firewall.
Signup for Amazon Web Services
- The first step is creating an AWS account by using this link.
- Follow the on-screen instructions to complete registration.
Configure Firewall
The Redshift data warehouse is best set up with access through port 5439 on your firewall. If your client computer is behind a firewall, be sure this port is configured to allow access. You can allow access through another port if necessary. However, you won’t be able to change it for your cluster after it is created.
Setting Up Amazon Redshift on AWS
AWS Redshift can transfer datasets from a variety of data sources using some of the options noted below. However, for this tutorial, we’ll be showing step-by-step how to create an Amazon Redshift cluster to load data from an AWS S3 bucket.
Getting Familiar With the AWS Console
Upon logging into the Redshift console, you will see a dashboard that contains an overview of your clusters. It shows the total number of nodes, number of automated snapshots, clusters with their status. The navigation pane contains options for performing the necessary operations on your cluster. The navigation pane consists of:
-
Dashboard - View the dashboard containing metrics for the clusters
-
Clusters - Create clusters, view list of clusters, and perform cluster-related actions
-
Queries - A list of recent queries and the SQL text for each. This option is useful for troubleshooting problem queries
-
Editor - Run database queries
-
Datashares - Authorize accounts to access datashares
-
Config - Configure connections to Redshift from SQL client tools. Here is where you configure JDBC and ODBC options.
-
Marketplace - Provides information and other tools or Amazon Web Services to complement your Redshift clusters
-
Advisor - Provides recommendations for configuration changes to help prioritize your optimization efforts
-
Alarms - Create alarms on clusters to notify you of performance issues or to track metrics over a specified period
-
Events - Track events and get reports on events such as date event occurred, description, or event source
-
What's new - Shows information about new Amazon Redshift features and product updates
Create a Cluster
The first step in getting set up is creating a cluster to store your databases.
-
Log in to the management console
- Navigate to the Amazon Redshift console
- On the upper right of the screen, select a region to create the cluster
- Select the Clusters option from the left-hand navigation menu
- Select Create Cluster
- On the Cluster configuration page enter values for Cluster identifier, Node type, Nodes, and how you plan to use the cluster (Free Trial or Production).
- In the database configuration section, specify the Admin user name and the Admin user password. The Database name and Database port are optional at this stage.
Create an IAM Role
An IAM role is required for any AWS service that needs to connect to your clusters. This role needs the appropriate permissions to access the resource. The new role that you create will be attached to your cluster.
- Navigate to the IAM console
- In the navigation pane, choose Role
- From there choose Create Role
- In the AWS Service group, select Redshift
- Select Redshift-Customizable for Use Case
- Select Next
- Select Permissions
- For Attach permission use case, select Customizable, then click Next: Permissions
- Choose AmazonS3ReadOnlyAccess
- Click Next: Tags
- You can choose to enter tags that help add further details to the role
- Click Next: Review
- In the Role name field, enter a name for your role
- Click Create Role
Grant Access to Query Editors
There are two methods for querying your Redshift database: one of the AWS-provided query editors or a SQL client tool. Using the query editor is the easiest way to get started with executing commands on the database. This tutorial will focus on using a query editor. The next step is attaching the appropriate policies to the IAM user that is used to access your cluster.
- Navigate to the IAM console
- Select Users
- Select the user that needs access to the query editor and choose Add permissions
- In the Policy Name field select: AmazonRedshiftQueryEditor and AmazonRedshiftReadOnlyAccess
- Click next to review your configuration
- Select Add Permissions
Create a Database
To create a database perform the following steps:
- Open the query editor
- Create your database using standard SQL (example: CREATE DATABASE SALESDB;)
- From here, you can use the COPY command to load sample data or load data from Amazon S3.
How Integrate.io Can Help
If you are looking to use Redshift as a part of your ETL inbound data process, Integrate.io can help. As a low-code development platform with a suite of over 100 prebuilt API integrations, companies can create pipelines that access data from a variety of sources. The tool features a user-friendly drag-n-drop interface that doesn’t require technical knowledge to use. It empowers business users to use the tool to get the information they need without relying on IT resources. To get started, reach out to us today to schedule a 7-day demo or pilot.