Let's get started with this great debate. First, a step back; we’ve pointed out that Apache Spark and Hadoop MapReduce are two different Big Data beasts. The former is a high-performance in-memory data-processing framework, and the latter is a mature batch-processing platform for the petabyte scale. We also know that Apache Hive and HBase are two very different tools with similar functions. Hive is a SQL-like engine that runs MapReduce jobs, while HBase is a NoSQL key/value database on Hadoop.
On paper, they have a lot in common. Both possess in-memory capabilities, both can run on top of Hadoop YARN, and both support all data types from any data source. So what’s the difference between the two?
- Tez fits nicely into YARN architecture. Spark may run into resource management issues.
- Spark is more for mainstream developers, while Tez is a framework for purpose-built tools.
- Spark can't run concurrently with YARN applications (yet). Tez is purposefully built to execute on top of YARN.
- Tez's containers can shut down when finished to save resources. Spark's containers hog resources even when not processing data.
These are just a few of the differences at a high level. Here we’ll explore each of these items.
What is Apache Spark?
Apache Spark is an open-source analytics engine and cluster computing framework for processing big data. It is the brainchild of the non-profit Apache Software Foundation, a decentralized organization that works on a variety of open-source software projects.
First released in 2014, it builds on the Hadoop MapReduce distributed computing framework. It preserves many of the benefits of MapReduce—like scalability and fault tolerance—while also improving speed and ease of use.
Besides its core data processing engine, it includes libraries for SQL, machine learning, and stream processing. The framework is compatible with the Java, Scala, Python, and R programming languages, winning broad appeal among developers. It also supports third-party technologies like Amazon S3, Hadoop's HDFS, MapR XD, and NoSQL databases such as Cassandra and MongoDB.
Its appeal comes from its capacity to unite different processes, technologies, and techniques into a single big data pipeline, enhancing productivity and efficiency. Thanks to its flexibility, it has become a highly popular and effective "Swiss army knife" for the world of big data processing.
What is Apache Tez?
Apache Tez is an open-source framework for big data processing based on MapReduce technology. Both offer an execution engine that can use directed acyclic graphs (DAGs) to process enormous quantities of data.
It generalizes the MapReduce paradigm by treating computations as DAGs. MapReduce tasks combine into a single job that is treated as a node in the DAG, enforcing concurrency and serialization.
Meanwhile, the edges of the DAG represent the movement of data between jobs. Tez is data type-agnostic, so it's concerned only with the movement of data (and not the format it takes).
By improving some of MapReduce's limitations, Tez seeks to improve the performance of data processing jobs. This added efficiency empowers programmers to make the design and development choices that they believe are best for their projects.
Apache Spark brands itself as "a unified analytics engine for large-scale data processing.” Meanwhile, Apache Tez calls itself "an application framework which allows for a complex directed acyclic graph of tasks for processing data."
Because Spark also uses directed acyclic graphs, don’t the two tools sound similar? Maybe. But there are also important points of distinction to consider. Here are the fundamental differences between the two:
- Difference #1: Hive and Pig
- Difference #2: Hadoop YARN
- Difference #3: Performance tests
We'll go into more detail about each of these differences in the sections below.
Do They Support Pig and Hive?
Hive and Pig are two open-source Apache software applications for big data. Hive is a data warehouse, while Pig is a platform for creating data processing jobs that run on Hadoop. While both claims to support Pig and Hive, the reality isn't so clear. We tried running Pig on Spark using the Spork project, but we had some issues; the use of Pig on Spark, at least, is still iffy at best.
Using YARN
YARN is Hadoop's resource manager and job scheduler. In theory, Spark can execute either as a standalone application or on top of YARN. Tez, however, has been purpose-built to execute on top of YARN. Though, Spark can't run concurrently with other YARN applications (at least not yet).
Gopal V, one of the developers for the Tez project, wrote an extensive post about why he likes Tez. He concludes that:
“Between the frameworks, I've played with, that is the real differentiating feature of Tez - Tez does not require containers to be kept running to do anything, just the Application Manager running in the idle periods between different queries. You can hold onto containers, but it is an optimization, not a requirement during idle periods for the session.”
By “frameworks” he also means Spark—its containers need to keep running and hog resources even when they aren’t processing any data. Tez containers, however, can shut down as soon as they are finished and release the resources.
Most chances are that you use Hadoop-based applications anyway like Hive, HBase or even classical MapReduce. So you can install Spark on any Hadoop cluster, but you may run into resource management issues. On the other hand, Tez could fit quite nicely into your YARN architecture, resource management included.
Where Apache Spark Shines: Graph Processing
GraphX is a graph computing engine that extends Spark RDD. The term “graph” means graphs from graph theory as opposed to those used for business calculations. Graphs used in graph theory capture interaction and dependency relationships between data.
GraphX began as a research project at UC Berkeley. The project was later donated to the Apache Software Foundation and the Spark project.
GraphX differs from other graph computing engines because it unifies graph analysis and ETL in a single platform. GraphX can also analyze data that is not in graph form. Its in-memory computation ability makes GraphX faster than other graph processing engines.
Common Use Cases for Graph Processing
Social Network Analysis - Used to identify influencers to target marketing
Fraud Detection - Banks, credit card companies, and online stores use graph analysis to identify unusual trends.
Supply Chain Optimization - Companies can use graph analysis to determine optimal routes for their supply chain
Loan Decisioning - Mortgage companies and banks use graph analysis to evaluate an applicant’s data to make a loan decision.
How Google Uses Graph Processing
Google uses a graph analysis algorithm known as the PageRank algorithm. The PageRank algorithm ranks vertices in a graph based on their importance, where importance is the number of edges directed to the vertex. This algorithm was developed by the founders of Google and thus the popular search engine is a prime example of PageRank. Google ranks pages based on their importance, with importance being the number of hyperlinks pointing to a page.
So Which One is Faster?
Perhaps the biggest question of them all—which is faster? According to various benchmarks, both options dramatically improve MapReduce performance; however, the winner may depend on who's doing the measuring. The jury's still out in terms of an independent third-party assessment.
Spark claims to run 100 times faster than MapReduce. Benchmarks performed at UC Berkeley’s Amplab show that it runs much faster than its counterpart (the tests refer to Spark as Shark, which is the predecessor to Spark SQL).
Because Berkeley invented Spark, however, these tests might not be completely unbiased. Also, these benchmarks were made several years ago with Hive 0.12, which runs over MapReduce. Beginning with version 0.13, Hive uses Tez as its execution engine, which results in significant performance improvements.
Meanwhile, Hortonworks did their benchmarks on the question performance between the two. They found that Hive 0.13 running over Tez works up to 100 times faster than Hive 0.12 (though quite a few test queries mysteriously disappeared). 100 times faster... hmm, sound familiar?
So they both have up to 100 times better performance than Hadoop MapReduce. But which is the fastest?
No one can say--or rather, they won't admit. If you ask someone who works for IBM they’ll tell you that the answer is neither and that IBM Big SQL is faster than both. We need a third party to run independent performance tests and settle the score, once and for all.
The Bottom Line
The question may ultimately come down to politics and popularity. It is a clash of the Big Data titans, with Cloudera rooting for Spark and Hortonworks for Tez. Spark is more widespread since it’s available in various distributions, while Tez is only available in Hortonworks’ distro.
In the end, the user bases may decide the frameworks’ fate. At the moment, Spark is winning the race by far, at least according to Google Trends.
Maybe after the hype has faded, after people have gained more experience working with both we’ll finally be able to tell who will become the heir to the MapReduce crown.
How Integrate.io Can Help
The world of data can be complicated. Integrate.io's low-code platform features a simple drag-and-drop interface to make it easy for users to build data pipelines with just a few clicks.
Looking for some guidance into this world of Spark vs. Tez and everything in between? Get in touch with the Integrate.io team today for a chat about your business needs and objectives.