Intrigued about the best SQL-on-Hadoop Tools? In our post "8 SQL-on-Hadoop Challenges," we gave an overview of several tools that help to bridge the gap between the two technologies—but without going into too many details. This time we’ll dive in and learn about 12 tools that bring SQL to Hadoop, covering both open-source and commercial solutions.
We've selected these tools for a combination of reasons including price, features, helpful use cases, accessibility, future potential, and more. Without further ado, here's our list of 12 essential SQL-on-Hadoop tools.
Open-Source SQL-on-Hadoop Tools
1. Apache Hive
Apache Hive is one of the top SQL-on-Hadoop tools. Initially developed by Facebook, Hive is a data warehouse infrastructure built on top of Hadoop. It allows querying data stored on HDFS for analysis via HQL, an SQL-like language translated to MapReduce jobs.
While it seems to provide SQL functionality, Hive performs batch processing on Hadoop and does not provide interactive querying. It stores metadata in a relational database and requires maintaining a schema for the data. Hive supports several file formats, as well as processing compressed data on Hadoop along with user-defined functions.
The bottom line: Hive offers batch processing on Hadoop with an SQL-like language.
Deciding on Hive vs. HBase? Check out our article!
2. Apache Sqoop
One of the most valuable SQL-on-Hadoop Tools, Apache Sqoop allows importing and exporting data from relational databases to Hadoop via JDBC, the standard API for connecting to databases with Java. It can also work without JDBC, as long as the relevant tools allow bulk import/export of data.
Sqoop works by running a query on the relational database and then exporting the resulting rows into files in one of the following formats: text, binary, Avro, or Sequence Files. These files are saved on Hadoop’s HDFS, and can also go from Hadoop back into a relational database. Finally, Sqoop integrates with HCatalog, a table and storage management service for Hadoop that allows querying Sqoop’s imported files via Hive or Pig. (See our Sqoop blog post for more info.)
The bottom line: Sqoop lets you import and export data from SQL databases to and from Apache Hadoop.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
3. Apache Phoenix
Apache Phoenix is an SQL skin for interactive queries over HBase. It compiles SQL queries into a series of HBase scans and produces JDBC result sets.
Note that Phoenix requires maintaining a schema that can either be built from scratch or mapped from an existing HBase table. Furthermore, there are several features Phoenix doesn’t support: full transaction support, derived tables, relational operators, and miscellaneous built-in functions (although they can be added manually). The project is mainly maintained by Salesforce, Intel, and Hortonworks.
The bottom line: Apache Phoenix offers interactive SQL queries over HBase.
4. Apache Impala
Apache Impala is a query engine that runs on top of Hadoop and executes interactive SQL queries on HDFS and HBase. Unlike Apache Hive, which uses batch processing, Impala runs the queries in real-time, thus allowing you to integrate SQL-based business intelligence tools with Hadoop.
Originally developed by the enterprise data cloud company Cloudera, Impala has since become part of the Apache open-source software ecosystem. Impala supports all standard Hadoop file formats, including text, LZO, Sequence Files, Avro, and RCFile. Impala can also run in the cloud via Amazon Elastic MapReduce (EMR).
The bottom line: Impala is an open-source solution for interactive SQL queries over HDFS and HBase.
5. Apache Spark SQL
Apache Spark SQL is a module for Apache Spark, an open-source big data processing engine, that helps you query structured data within Spark programs. It is the successor to Apache Shark, a large-scale data warehouse system for Apache Spark that has been defunct for several years.
Spark SQL introduces a concept known as DataFrames, which are equivalent to relational database tables with additional enhancements and optimizations. The Spark SQL module is compatible with many different data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC, and programming languages including Java, Scala, Python and R.
The bottom line: Apache Spark SQL is an excellent choice if you need SQL-on-Hadoop capabilities within Apache Spark.
6. Apache Drill
Apache Drill calls itself a "schema-free SQL query engine for Hadoop, NoSQL and cloud storage." Drill makes use of a JSON data model that allows for SQL queries over complex and nested data and non-relational tables. You can use standard SQL queries with Drill, just like you would use to query a relational database.
The open-source Drill project is based on Google's Dremel system for querying large, distributed datasets. Drill has a REST API and integrates with business intelligence applications such as Tableau, Qlik, SAS, and Microsoft Excel.
The bottom line: Apache Drill lets you use SQL to query Hadoop and NoSQL databases.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
7. Presto
Presto is an interactive SQL query engine that runs on top of Hive, HBase, and even relational databases and proprietary data stores, helping you combine data from multiple sources across the organization. According to the project website, Presto is "the fastest SQL on Hadoop engine," with the benchmarks to back it up.
Facebook is the main developer behind Presto, and the company uses it to query internal data stores, including a 300-petabyte data warehouse. Other big enterprises such as Airbnb and Dropbox also use Presto in their tech stacks.
The bottom line: Presto is Facebook’s enterprise-class solution for interactive SQL queries over Hive and HBase.
8. Citusdata
Citus Data (not to be confused with CitrusDB) is another interactive querying engine with SQL-like functionality that works over Hadoop. Citus is based on Dremel, Google’s version of a real-time analytics database for processing big data. Since its acquisition by Microsoft in 2019, Citus has been available both as open-source software and as the Hyperscale (Citus) deployment option in Azure Database for PostgreSQL.
Unlike Impala and Presto, Citus uses PostgreSQL as the SQL engine that works behind the scenes. Citus can run on-premises or in the cloud, and supports features such as full-text search, geo search, ODBC/JDBC compatibility. Note that as an analytical database, Citus only supports loading the data in batches.
The bottom line: Citus offers SQL-on-Hadoop interactive querying with PostgreSQL.
Commercial SQL-on-Hadoop Tools
9. Jethro
Jethro claims to offer "the fastest SQL-on-Hadoop engine for BI" by providing an SQL engine for Hadoop that automatically indexes the data as soon as it gets written to Hadoop. According to Jethro's website, the tool can deliver "up to 100 times faster queries" than tools such as Hive and Impala. Installation and use is simple: Jethro can be added to an existing Hadoop cluster, is non-intrusive, and isn't installed on any of the Hadoop storage nodes.
The bottom line: Jethro offers fast, non-intrusive SQL-on-Hadoop via auto-indexing.
10. HAWQ
HAWQ is a commercial SQL-on-Hadoop platform by Pivotal, a subsidiary of EMC. It provides a parallel SQL query engine using Pivotal’s Greenplum Analytic Database and Hadoop’s HDFS for data storage. The HAWQ engine is useful for analytics with full transaction support and supports creating external tables on HDFS that read text, Hive, HBase, and Parquet.
The bottom line: HAWQ is Pivotal’s SQL-on-Hadoop solution.
11. BigSQL
Big SQL by IBM (not to be confused with Postgres by BigSQL, an open-source project which seems to be defunct) is IBM's SQL-on-Hadoop engine for massively parallel processing (MPP) and advanced data queries. You can use IBM Big SQL to access a wide variety of formats, including relational databases, NoSQL databases, HDFS, and object stores.
IBM Big SQL boosts query performance by distributing queries across multiple nodes in a cluster. Use cases for Big SQL include offloading data from existing Oracle or IBM data warehouses, as well as federated access to relational databases containing data that can't be migrated into Hadoop.
The bottom line: IBM Big SQL offers a robust SQL on Hadoop tool for large enterprises.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
12. PolyBase
PolyBase is Microsoft's SQL-on-Hadoop tool for processing SQL queries. Microsoft SQL Server customers can use PolyBase to access data in Hadoop and Azure Blob Storage, as well as SQL Server, Oracle, Teradata, and MongoDB databases.
By using the Transact-SQL language, PolyBase users can query data stored in Hadoop, and also import data from and export data to Hadoop. PolyBase's query optimizer can decide to push computation to Hadoop, creating MapReduce jobs that take advantage of Hadoop's distributed nature.
The bottom line: PolyBase is a powerful SQL-on-Hadoop tool for users of Microsoft SQL Server.
Looking for SQL on Hadoop Guidance? Here's How Integrate.io Can Help
Need a big data guide through the world of Hadoop? Integrate.io's data integration experts know how to help you optimize your big data workflows. Our cloud-based ETL (extract, transform, load) solution delivers simple, visualized data pipelines for automated data flows across a huge range of sources and destinations.
Ready to begin? Contact the Integrate.io team to schedule a demo and risk-free trial, and start experiencing the benefits of the Integrate.io platform for yourself.