Big change is coming
Over the past years, we have been providing a web service to make it easy to process and analyze big data on the cloud using Hadoop, utilizing Amazon Elastic MapReduce. This experience has taught us the benefits and caveats of using Hadoop and Hive. When we heard the announcement for Amazon Redshift last November, we were definitely intrigued by Amazon's claim to simplifying big data. Fortunately, we were able to get preview access for Redshift. After some initial testing of Redshift, we were very surprised with the performance over a relatively average amount of data (a few terabytes of data). We immediately saw how Redshift filled in some of the caveats we saw from Hadoop and Hive. After all of our experience, we saw most people were using Hadoop and Hive for simple queries on large amounts of data. In fact, we were so impressed by Redshift, we started recommending potential clients to look into Redshift instead.
Big change is coming
Over the past years, we have been providing a web service to make it easy to process and analyze big data on the cloud using Hadoop, utilizing Amazon Elastic MapReduce. This experience has taught us the benefits and caveats of using Hadoop and Hive. When we heard the announcement for Amazon Redshift last November, we were definitely intrigued by Amazon's claim to simplifying big data. Fortunately, we were able to get preview access for Redshift. After some initial testing of Redshift, we were very surprised with the performance over a relatively average amount of data (a few terabytes of data). We immediately saw how Redshift filled in some of the caveats we saw from Hadoop and Hive. After all of our experience, we saw most people were using Hadoop and Hive for simple queries on large amounts of data. In fact, we were so impressed by Redshift, we started recommending potential clients to look into Redshift instead.
The benchmark
After our initial tests, we wanted to try running a benchmark to put some numbers down and compare Redshift to what we were using. We wanted to use a real world example to do this comparison so we tried to tailor our tests towards one field, advertising networks. This is because advertising networks need to show their reports for calculating total impressions, total clicks, ad spending, CPC, CPM, CTR, etc. to customers frequently and as quickly as possible. Based on what I have seen personally out in the field as a product / tech manager of a Location based ad network system, this was the ideal situation to perform benchmarks on. However, ad networks regularly collect a huge amount of data (especially impression data) and a normal applications should be on distributed systems. In this situation, they almost always need to use Hadoop even if data is only a few terabytes. The most difficult and time consuming part of this benchmark was the importing of data to a Redshift cluster. Our first attempt at importing 300GB of data took about 5 hours. We felt reporting this is important because this transfer speed could be a limiting factor for some uses. We then tried to upload 1.2 TB of data onto Redshift, 2,880 files at 300MB each. This upload took 17 hours. Once the data was uploaded, we devised a query to compare Hadoop and Redshift. The query decided upon was something that could be run equivalently on both Hadoop and Redshift; we used a full table scan query that joins 4 tables. For our first set of data (300GB), the query finished in 1 minute. We were very surprised since the equivalent query took 10 minutes to finish on Hadoop. At this point, we saw the dramatic reduction of costs since there was no need to run as many serviers on Redshift as well as the very drastic reduction in necessary maintenance costs. We then ran the same query on 1.2TB of data. Our result scaled linearly so it took just under two minutes for Redshift while it took over 20 minutes for the equivalent Hive query to finish on Hadoop.
Why test Redshift?
To be fair, we understand that this benchmark is not a fair technical comparison. However, we ran this test based on what we saw as "actual usage of Hadoop". Based on our significant amount of previous experience, the reality is just that users would just default to using Hive on Elastic MapReduce to run these simple queries on large amounts of data often and nothing more. Of course, there are users who leverage RCfile or HBase still benefit from Hadoop but they are really the progressive users, making up a small percentage of actual users. Before Redshift, Hadoop + Hive was the way to go for processing terabytes of data because there was no way to process this amount of data quickly and for a reasonable price (thousands per month instead of millions). Columnar databases such as Vertica and Netezza are definitely good candidates but unfortunately very expensive. Uses of frequent queries like incremental reporting and optimization were not fit for Hadoop usage and those who used Hadoop for this ended up paying a relatively high price. The arrival of Redshift brings a completely destructive alternative because of its value. It's a well-optimized columnar data warehouse system, including computer servers, network, storage, and even backup S3 storage. Amazon has mad big claims by saying that their price is one tenth of the price at data warehouses. From what we've seen though, users who need to run a few queries per hour will save at least that claimed amount (if not a lot more) when comparing Redshift and Hadoop for their jobs. We want to help people analyze their big data most efficiently. There has been many cases of unnecessary usage of Hadoop and we want to introduce options such as Redshift to those cases.
The benchmark
After our initial tests, we wanted to try running a benchmark to put some numbers down and compare Redshift to what we were using. We wanted to use a real world example to do this comparison so we tried to tailor our tests towards one field, advertising networks. This is because advertising networks need to show their reports for calculating total impressions, total clicks, ad spending, CPC, CPM, CTR, etc. to customers frequently and as quickly as possible. Based on what I have seen personally out in the field as a product / tech manager of a Location based ad network system, this was the ideal situation to perform benchmarks on. However, ad networks regularly collect a huge amount of data (especially impression data) and a normal applications should be on distributed systems. In this situation, they almost always need to use Hadoop even if data is only a few terabytes. The most difficult and time consuming part of this benchmark was the importing of data to a Redshift cluster. Our first attempt at importing 300GB of data took about 5 hours. We felt reporting this is important because this transfer speed could be a limiting factor for some uses. We then tried to upload 1.2 TB of data onto Redshift, 2,880 files at 300MB each. This upload took 17 hours. Once the data was uploaded, we devised a query to compare Hadoop and Redshift. The query decided upon was something that could be run equivalently on both Hadoop and Redshift; we used a full table scan query that joins 4 tables. For our first set of data (300GB), the query finished in 1 minute. We were very surprised since the equivalent query took 10 minutes to finish on Hadoop. At this point, we saw the dramatic reduction of costs since there was no need to run as many serviers on Redshift as well as the very drastic reduction in necessary maintenance costs. We then ran the same query on 1.2TB of data. Our result scaled linearly so it took just under two minutes for Redshift while it took over 20 minutes for the equivalent Hive query to finish on Hadoop.
Why test Redshift?
To be fair, we understand that this benchmark is not a fair technical comparison. However, we ran this test based on what we saw as "actual usage of Hadoop". Based on our significant amount of previous experience, the reality is just that users would just default to using Hive on Elastic MapReduce to run these simple queries on large amounts of data often and nothing more. Of course, there are users who leverage RCfile or HBase still benefit from Hadoop but they are really the progressive users, making up a small percentage of actual users. Before Redshift, Hadoop + Hive was the way to go for processing terabytes of data because there was no way to process this amount of data quickly and for a reasonable price (thousands per month instead of millions). Columnar databases such as Vertica and Netezza are definitely good candidates but unfortunately very expensive. Uses of frequent queries like incremental reporting and optimization were not fit for Hadoop usage and those who used Hadoop for this ended up paying a relatively high price. The arrival of Redshift brings a completely destructive alternative because of its value. It's a well-optimized columnar data warehouse system, including computer servers, network, storage, and even backup S3 storage. Amazon has mad big claims by saying that their price is one tenth of the price at data warehouses. From what we've seen though, users who need to run a few queries per hour will save at least that claimed amount (if not a lot more) when comparing Redshift and Hadoop for their jobs. We want to help people analyze their big data most efficiently. There has been many cases of unnecessary usage of Hadoop and we want to introduce options such as Redshift to those cases.