Should you pick Amazon Redshift or Hadoop/Hadoop Hive for your data warehouse?
In this guide, we examine the history and capabilities of Redshift and Hadoop, and how they compare across price, performance, and ease of use. Based on our research, our personal experiences, and the experiences of our clients, we recommend Redshift for almost all cases, and detail exactly why down below. We know that picking the right solution for your data needs is both important and challenging, so if you have any questions, let us know in the comments, and we’ll get back to you ASAP.
A Brief History
When Hadoop was first released in 2005 by the Apache Software Foundation, it was a game-changer in the world of big data. By using a shared file system to distribute datasets across multiple servers, it allowed easier and faster analysis of large datasets than ever before. In 2010, Facebook attempted to make Hadoop more accessible to its team of data analysts by releasing Hadoop Hive, a SQL-like querying interface for Hadoop. Then, in 2012, Amazon Web Services released Redshift, which dramatically decreased the time and cost of storing and analyzing terabyte-sized structured datasets. In 2016, AWS launched Athena, which allowed SQL-like querying of files in s3, and in April of 2017, AWS announced a similar addition to the Redshift platform: Redshift Spectrum, which allowed customers to query s3 files directly through Redshift and combine the results with their data already in Redshift.
What is Hadoop?
Hadoop was one of the first attempts at solving a problem that has become increasingly more common: How do you store and analyze a dataset that is too large to fit on one server? Hadoop solves this problem by utilizing a distributed file system and mapreduce.
On a distributed file system, data is shared at a disk-level between a cluster of servers. This presents an interesting advantage over traditional data warehouses: Hadoop can store both structured and unstructured data.
Because Hadoop is an open-source software that can be deployed on commodity servers, its scalability is almost limitless. However, since Hadoop utilizes Java and has a complex setup across multiple servers, getting started with and maintaining a Hadoop cluster can be an expensive and time-consuming proposition. While cloud services such as AWS EMR and Cloudera have made spinning up Hadoop clusters much easier, maintaining the cluster, importing data, and querying that data requires specialized knowledge.
To analyze the data, a MapReduce process will run on each server to scan the data assigned to it and “map” it into a simple key-value pair system, e.g. a list of states and the number of cities in each state. It will then “reduce” that data into some analytical result, e.g. the average number of cities per state. The results from each server are then combined together to determine the final value.
This process of having each server analyze a small portion of the data and then combine the results is much faster than having a single server go over each record individually, and it is roughly the same underlying process that Redshift uses in its Massively Parallel Processing architecture.
What is Hadoop Hive?
Hadoop Hive can roughly be understood as an attempt to get Hadoop’s distributed file system and MapReduce structure to behave more like a traditional data warehouse by allowing data analysts to run SQL-like queries on top of Hadoop. The queries, written in HiveQL, are translated into MapReduce jobs written in Java and are run on the Hadoop Distributed File System. Rather than needing to learn Java, theoretically, you would just need to make a few edits to your SQL queries to translate them into HiveQL.
While there is no doubt that Hadoop Hive makes Hadoop more accessible to the average data analyst, it begs the question:
If your system is so complex that you need an entire framework to translate your queries into a different language and architecture, perhaps a simpler system would be easier?
We’ll return to this idea in a moment, but first, let’s discuss Redshift.
What is Redshift?
AWS Redshift is a cloud data warehouse that uses an MPP architecture (very similar to Hadoop’s distributed file system - we recommend reading our guide) and columnar storage, making analytical queries very fast. Moreover, it is SQL based, which makes it easy to adopt by data analysts. Unlike Hadoop’s distributed file system, Redshift proper is a database, so it only supports structured data. However, AWS also allows you to use Redshift Spectrum, which allows easy querying of unstructured files within s3 from within Redshift.
I think it’s safe to say that the development of Redshift Spectrum was an attempt by Amazon to own the Hadoop market. As we’ll see below, Redshift tends to be faster, cheaper, and easier to maintain than Hadoop, so the only reason one would have picked Hadoop would be because of Hadoop’s ability to query unstructured data. With Redshift Spectrum, that reason is nullified.
GET BACK TO WORK.
We'll handle your ETL pipeline.
Redshift vs. Hadoop: Which one wins?
Here at Integrate.io, we’ve helped dozens of companies solve their big data challenges. Based on our personal experiences, client experiences, and the research that we have done, we have determined that in almost all cases, Redshift is the way to go. Let’s break it down piece by piece.
Performance
Tests have shown that Redshift can be 5x to 20x faster than Hadoop Hive on the same dataset.
Since Redshift is a columnar database, the data must be structured, and this will mean faster querying over any unstructured data source. Moreover, since Redshift uses a Massively Parallel Processing architecture, the leader node manages the distribution of data among the follower nodes to optimize performance.
A well-known test conducted by Airbnb showed that Redshift was five times faster to complete a simple range query against a table with three billion rows. In another test Airbnb ran using a complex query with two joins that ran against a million rows, Hadoop Hive took 182 seconds to complete the query while Redshift only took 8.
In our own test, we ran a full-table scan query that joined 4 tables for a combined 400GB of data. It finished in one-minute on Redshift, but it took a full 10 minutes to complete on Hadoop Hive. As we scaled up the data to 1.2TB, we saw similar results, with the query taking 2 minutes on Redshift and over 20 minutes on Hadoop Hive.
Cost
We wrote a full blog post on Redshift’s pricing, which we recommend you check out, but while comparing two completely different systems for all use cases is tricky, it appears that in most cases, Redshift will be the cheaper option. In their own test, Airbnb's data engineering team concluded that their setup would cost approximately $13.60 per hour on Redshift, but over $57 per hour on Hadoop. Our own Redshift pricing analysis demonstrates that if you follow the right steps, you can test and run Redshift relatively cheaply.
Ease of Use
Creating a Redshift cluster is super easy and can be done in minutes. As a cloud service supported by AWS, your Redshift database is fully managed for you, fault tolerant, and supports automated backups and facilitates fast restores.
On the other hand, even with a hosted cloud service like AWS EMR, managing a Hadoop cluster can be its own full-time job. While Hadoop can support automated backups and fast restores and all the works like Redshift can, these don’t come by default. Just the expertise needed for a Hadoop implementation is bound to make any data warehousing activity an expensive and time-consuming exercise in planning, development, deployment, and administration.
Scale
You’ll have to examine your own use case to determine how important massive scale is for your needs. While Redshift proper has an upper limit of 100 nodes and a maximum of 16TB of storage per node, thanks to Redshift Spectrum, you can store an almost unlimited amount of data in s3 cheaply and query it only when you need to. Meanwhile, there are virtually no limits to scaling Hadoop. In scalability, I would say the two systems are roughly tied.
Data Transfer
To be honest, neither are great options here. Hadoop’s complex setup makes getting data into its file system non-trivial, while making sure data in Redshift is up-to-date and accurate in near real-time is no easy task. We’ve seen dozens of companies get stuck at the replication and sync stage of setting up their data warehouse. That’s where Integrate.io comes in. Integrate.io provides continuous, near real-time replication into Redshift from your transactional databases, such as MySQL, PostgreSQL, Amazon Aurora, and more. With an easy, one-time setup, our robust system ensures 100% accuracy with each load. Your data is always up to date.
Conclusion
While Hadoop was one of the first major players in the space and is still supported by it’s open-source nature, Redshift is clearly becoming the preferred solution for almost all use cases. We hope this guide has helped you determine which solution is right for you, and once again, if you have any questions, we would love to help you out as much as we can in the comments!