Our previous set of slides have gotten a bit of attention from a lot of people interested in big data. One of the things we have seen is that a lot of people are concerned about the time it took to load all of our data onto Redshift, specifically about the "17 hours for 1.2TB".
Redshift Can Scale
For our testing, we ran a single node XL instance, a multi node XL instance, and a 8XL multi node instance (it is not possible to choose a single node 8XL instance) to compare loading for 1.2TB of data and query speeds on that data. We saw this in our tests: For loading 1.2TB of data:
- A single node XL instance took 17 hours
- A multi node XL instance of two nodes took 10 hours
- A multi node 8XL instance of two nodes took 2 hours
- Load speeds are almost proportional to the number of nodes
Running identical queries:
- A single node XL instance took 155 seconds
- A multi node XL instance of two nodes took 55 seconds
- A multi node 8XL instance of two nodes took 31 seconds
- A query runs faster when there are more nodes but the performance does not rise in a linear fashion
These results are very interesting because loading speed increases by server nodes. Loading to clusters run on all instances in parallel. On the other hand, querying on muliple nodes is faster on multiple nodes than it is on single. This shows parallel processing succeeding in this range. In fact, a two node cluster is much faster than half the time of a single node. From this result, we can see that Redshift clusters are probably optimized for multiple node clusters.
Additional Thoughts
We realize 8XL instances cannot be used in a single node cluster. It is a restriction of AWS. This is a point we considered. It means that we need to use a 15 node XL instance before we can consider launching the 8XL instnace. Fortunately, AWS provides a way to upgrade your XL instance to an 8XL instance on the fly with just a few minutes of downtime.
Next Step
These results show how scalable Amazon Redshift is at both data loading and querying. There needs to be more experiments done to determine how they scale even more data (more than a few dozen TB of data). Next, we are planning to test various types of queries, including manipulating text columns and different types of compression.
Our previous set of slides have gotten a bit of attention from a lot of people interested in big data. One of the things we have seen is that a lot of people are concerned about the time it took to load all of our data onto Redshift, specifically about the "17 hours for 1.2TB".
Redshift Can Scale
For our testing, we ran a single node XL instance, a multi node XL instance, and a 8XL multi node instance (it is not possible to choose a single node 8XL instance) to compare loading for 1.2TB of data and query speeds on that data. We saw this in our tests: For loading 1.2TB of data:
- A single node XL instance took 17 hours
- A multi node XL instance of two nodes took 10 hours
- A multi node 8XL instance of two nodes took 2 hours
- Load speeds are almost proportional to the number of nodes
Running identical queries:
- A single node XL instance took 155 seconds
- A multi node XL instance of two nodes took 55 seconds
- A multi node 8XL instance of two nodes took 31 seconds
- A query runs faster when there are more nodes but the performance does not rise in a linear fashion
These results are very interesting because loading speed increases by server nodes. Loading to clusters run on all instances in parallel. On the other hand, querying on muliple nodes is faster on multiple nodes than it is on single. This shows parallel processing succeeding in this range. In fact, a two node cluster is much faster than half the time of a single node. From this result, we can see that Redshift clusters are probably optimized for multiple node clusters.
Additional Thoughts
We realize 8XL instances cannot be used in a single node cluster. It is a restriction of AWS. This is a point we considered. It means that we need to use a 15 node XL instance before we can consider launching the 8XL instnace. Fortunately, AWS provides a way to upgrade your XL instance to an 8XL instance on the fly with just a few minutes of downtime.
Next Step
These results show how scalable Amazon Redshift is at both data loading and querying. There needs to be more experiments done to determine how they scale even more data (more than a few dozen TB of data). Next, we are planning to test various types of queries, including manipulating text columns and different types of compression.