Introducing Apache Hadoop to an organization can be difficult. Your workforce is trained and experienced in SQL and the analytics tools that integrate with it. However, when you reach a point where the data is too big for a MySQL server, stepping up into the world of Big Data becomes necessary.
One of the key elements of Big Data is data variety. You gain the ability to store structured and unstructured data without a predefined schema. SQL-based technologies only handle structured data with a predefined schema. While you can leverage additional engines prepackaged with the database to store semi-structured data such as JSON, XML, or BLOB to SQL, you’re still limited on the data types you can collect. You also add to your maintenance overhead to keep the schema up to date. Big Data supports Agile methodologies and unpredictable schemas, so you eliminate those added maintenance tasks. Hadoop is commonly used for organizations moving to Big Data.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
What is Hadoop?
Hadoop is an open-source framework that supports your Big Data efforts through distributed processing. The Apache Software Foundation manages this open-source project, and it was originally developed by Doug Cutting, Mike Cafarella, and Yahoo!.
You’re able to store any data type, leverage massively parallel processing, and to begin getting value out of your ever-increasing collection of data. This framework is set up for scalability with a distributed computing model, so it’s able to handle your current and future workloads on commodity hardware. Hadoop uses the Hadoop Distributed File System, or HDFS, and the MapReduce programming model to query semi-structured data at scale.
The four components of this framework are:
- HDFS
- Hadoop Common
- MapReduce
- Yet Another Resource Negotiator (YARN)
For more information on Integrate.io's native Hadoop HDFS connector, visit our Integration page.
The Hadoop framework delivers many benefits to your organization’s data strategy, including fault tolerance, decreasing your data storage costs, and the ability to store as much structured and unstructured data as you need. However, it does have some distinct drawbacks, especially when you’re starting your transition.
MapReduce programming is not as efficient as it could be for advanced analytics needs. You have to use a file-intensive process that goes through multiple shuffle and sort phases if you want to use an iterative approach.
You also have to consider whether your organization has the technical talent needed to support Hadoop. Specialists in this area are in-demand, and there’s a smaller candidate pool compared to SQL. You may struggle to find staffing for your Hadoop projects, but there’s an option to leverage existing skill sets and set up more efficient analytics while still getting the Big Data power of this framework: SQL-on-Hadoop solutions.
How Hadoop Supports Big Data and Data Scalability
When you start looking at using data lakes and data warehouses for your data stores in your organization, you need a way to work with these large data sets. The Hadoop ecosystem has many capabilities that make it a strong foundation for creating scalable analysis processes and operating in a Big Data environment:
- Support for up to 4,500 connected machines, which includes commodity hardware
- Every machine includes a data node and a task tracker
- Automatically uses replication for high availability and creates backup data sets for fault tolerance through the application layer
- Provides steady throughput when working with massive quantities of data through scalable parallel processing
- Scale horizontally and vertically
What is SQL-on-Hadoop?
As the name implies, the SQL-on-Hadoop definition is a tool that allows you to use SQL-like querying for structured data while leveraging Hadoop for your analytics requirements. You reduce the number of Hadoop specialists you need to recruit and get your organization up and running on this framework faster.
Each solution has its own approach for executing SQL queries in Hadoop, but they typically use one or more of the following methods:
- SQL to MapReduce translation connectors
- Pushing SQL down to the Hadoop cluster
- Using both MapReduce-HDFS and raw HDFS clusters for SQL query execution
How MapReduce Works in Apache Hadoop
Here’s an example of how one SQL-on-Hadoop tool, Apache Hive, functions. It uses HQL, an SQL-like language that is translated to MapReduce on Hadoop. It can help run jobs without your team needing to learn any new code, and it integrates with existing SQL-based tools.
However, the MapReduce jobs run as batch data processing, which means it could take much longer to return results than you’re expecting. SQL users are accustomed to interactive queries that take anywhere from seconds to minutes to return results, so users could get frustrated at the experience.
Check out our guide to 12 SQL-on-Hadoop tools to learn more about the specific solutions available.
How Does HDFS Work
Other SQL-on-Hadoop technologies include Hadapt, Impala, Presto, and Jethro, many of which support interactive querying on Hadoop. These tools help but they still run on top of HDFS, which works differently from SQL’s relational tables and those of other relational databases. HDFS cuts files into blocks of data that it spreads over a cluster of machines to store them. It keeps duplicate copies of the data and maintains integrity automatically. Due to this configuration, you may encounter these common SQL-on-Hadoop challenges.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Common SQL-on-Hadoop Challenges
1) Unsupported File Formats
While Hadoop supports the storage of all file formats, SQL-on-Hadoop technologies require rigid data formats in order to process it. You can run into a situation where the solution may not support storing and querying all of your organization’s data, which can arrive in a variety of formats or no format at all.
Jethro and other tools require using their own file structure. Impala is fully compatible with text files and Parquet, a columnar storage format for Hadoop, while providing partial support for other formats. Presto works with Hadoop file formats such as text, RCFile, and SequenceFile. Hive supports implementing a custom serializer/deserializer function that can read and write any file format. However, you have to invest in programming to set that option up.
2) Increasing Server Maintenance Complexity
Some SQL-on-Hadoop technologies, such as CitusDB, Hadapt, and BigSQL, require each node in the Hadoop cluster to have PostgreSQL installed. As your clusters scale in size and quantity, deploying and maintaining PostgreSQL becomes cumbersome. You may need to scale up your maintenance team to accommodate these requirements, which results in more costly recruitment and staffing expenses.
The more complex your server maintenance is, the more likely that something could get missed. An unpatched database is an open invitation for hackers, and data breaches are resource-intensive and costly.
3) Increasing Schema Maintenance Requirements
One of Hadoop’s selling point is its lack of schema. However, if you want to use SQL on Hadoop, you need to define and manage a schema. If new data comes in that doesn’t fit into this schema, you run into issues. Hadapt claims it doesn’t require schema definition for self-descriptive JSON or XML formats, but standard SQL databases already have this ability so it’s not bringing anything new to the table.
4) Lack of ACID Compliance
SQL databases support Atomicity, Consistency, Isolation, Durability, or ACID, to guarantee reliable database transactions. Hadoop doesn’t offer ACID compliance, so you need a third-party tool to provide it. Hive is one SQL-on-Hadoop tool that offers ACID compliance as an option, and other options have capabilities that are similar to one or more aspects of ACID.
5) Lack of OLTP Support
Hadoop is based on sequential reads and doesn’t support updates, so it’s more useful for Online Analytical Processing, or OLAP, by definition. Hive, an SQL-on-Hadoop tool based on MapReduce, does not support Online Transaction Processing, or OLTP, since that programming doesn’t do single-row operations. While other tools are not based on MapReduce, they still target analytical queries. HBase provides transactional functionality, but it’s not ACID compliant, so you can’t use it to guarantee reliable database transactions.
6) Unable to Access Advanced SQL Functionality Out-of-the-box
SQL offers many features that go beyond queries, such as views, stored procedures, and user-defined functions. Most SQL-on-Hadoop tools don’t support these advanced capabilities, and you need to commit development resources to solve these problems. For example, you’d need to write Java code for Hive and C++ for Impala.
7) Lack of Support for Update Statements
HDFS doesn’t support update statements, unlike SQL. Some SQL-on-Hadoop tools may implement it, but it isn’t clear exactly how they accomplish that, as it requires random read/write access to all data on Hadoop. Hadoop doesn’t provide that feature. It’s possible they implement it like HBase, which uses in-memory indexes and compacts files occasionally to remove older versions.
8) Lack of Manual Joins and Dimensions
HDFS automatically manages how to spread blocks of data over the cluster, so you can’t manually control joins and dimensions. In certain cases, this can be counter-productive to your SQL-on-Hadoop goals. Saving several pieces of data together on the same node or on all the cluster nodes could be necessary to help execute joins and dimensions more efficiently for data such as product names, categories, or clients. Without manual control, it takes more time to bring all the relevant data together from across the network.
9) Slower Querying Speed with Disk-Based Query Engines
Some SQL-on-Hadoop solutions use disk-based query engines, which slows down the querying process compared to an in-memory engine. Even a small delay in processing speed makes a massive difference when you’re working with large data sets in real-time for big data analytics. Caching is one way these solutions try to address this problem, but the query optimizer needs to go through a cache validation process rather than automatically understanding that the necessary data is in memory.
Some organizations use a different open-source data processing framework called Apache Spark to implement in-memory operations. This approach works best for smaller data sets that can fit in the available memory, and includes libraries that cover machine learning and SQL capabilities. It also offers APIs for R, Java, and others.
10) Failing to Find the Right Combination of Hadoop Distribution and SQL-on-Hadoop Solution
Hadoop is open-source software, so many different distributions are available. If you choose a Hadoop and SQL-on-Hadoop combination that doesn’t work well for your use cases, then you could have a problem with realizing the benefits of this technology. Many vendors have a recommended distribution they will encourage you to use, but that’s not necessarily the best one for your requirements. Testing different distributions and SQL-on-Hadoop tools is a time and resource-consuming process.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
SQL-on-Hadoop or ETL: Which is the Right Option for Your Analytics?
SQL-on-Hadoop solutions are not the only way to work more efficiently with your Big Data operations. An Extract, Transform, Load, or ETL, tool such as Integrate.io can power your data pipeline and get you access to your valuable data faster. This solution works by connecting your data sources, preparing the data through cleansing, masking, formatting, and other transformations, and loading it into your data stores. This automated process lets you spend more time on analyzing your clean, secure data and less time worrying about how it gets from source to storage.
Try out a fourteen-day demo to discover the possibilities Integrate.io has to offer.