With technology innovations raging at incredible speeds over the past few decades, new and exciting platforms for gathering, storing, transforming, and manipulating data are entering the market every day. Apache Hadoop was one of these disrupters when it entered the market in 2006, offering distributed storage and big data processing using a network of many computers. This article reviews the background of Apache Hadoop, the technology stack and architecture, and the common use cases for this forward-thinking platform.
- What is Apache Hadoop?
- Apache Hadoop Technology Stack
- Managing ETL in Hadoop with Integrate.io
- Final Thoughts
What is Apache Hadoop?
Apache Hadoop is a platform built on the assumption that hardware failure is an expectation rather than an anomaly. The original brainchild was actually a Google File System paper published in October 2003. The project evolved over the next few years, eventually adopting the name of the toy elephant that belonged to the son of one founder.
Released in April 2006, Hadoop 0.1.0's core technology is a distributed file system and processing framework that spreads large blocks of files across nodes in a cluster and then processes the data in parallel. With thousands of clusters providing local computation and storage, a failure in one cluster has little effect on the overall system. In Hadoop, expectations are that a component will almost always be nonfunctional and that the system will distribute the load and recover quickly.
The result of this complex engineering is an architecture that delivers high availability and fast processing speeds for large data stores that can include both structured and unstructured data. It also delivers high throughput and, as it is an open-source project, developers can adjust the Java code to suit their requirements. Hadoop is compatible with all major platforms and can host applications of any size, especially those requiring fast and furious data flow.
Read about how Hadoop can also process small data files.
Apache Hadoop Technology Stack
The tech stack of Hadoop comprises five primary modules supplemented with a wide range of additional tools, frameworks, and related projects. Because it is open-source, anyone can use the code to build their own project, so the platforms and frameworks available to Hadoop enthusiasts will continually be growing. The five Hadoop modules are:
Hadoop Common
Techies also refer to Hadoop Common as Hadoop Core. This is because it refers to the libraries and utilities that support the entire system. It contains the underlying operating system, file system, Java Archive (JAR) files, scripts, source code, and documentation required for the technology to function. Hadoop Common is what enables users to take advantage of the Hadoop architecture including:
- Starting the CLI MiniCluster
- Configuring and managing the Fair Call Queue
- Using the Native Hadoop Library
- Setting up Proxy Users
- Configuring authentication for Hadoop in a secure mode
- Setting up Service Level Authorization
- Configuring Hadoop HTTP web consoles to require user authentication
-
Modifying the default behavior of the shell
Hadoop Distributed File System (HDFS)
The primary architecture of Hadoop, HDFS is a highly configurable distributed file system that breaks files into blocks and distributes them across machines to be replicated to at least three nodes. The default setup is typically adequate for most installations, but users can tune it for large clusters. HDFS uses a master/slave architecture with the following components:
- Clusters comprise a single NameNode and a master server
- NameNode runs on the master machine
- DataNodes run on the slave machine, typically one DataNode per cluster
- Blocks are files of business data split and stored in a set of DataNodes
- A NameNode sends instructions to the DataNodes to create, delete or replicate blocks
- DataNodes then execute the commands
- Every three seconds, DataNodes report the health of HDFS to the NameNode
- Machines running the NameNode and DataNodes usually run on Linux
Hadoop YARN
YARN is an interesting acronym; it stands for Yet Another Resource Manager. Its basic function is to split resource management and job scheduling into two different tasks. YARN has several interlocking components:
Two separate daemons: ResourceManager and ApplicationMaster
- ResourceManager: teams up with NodeManager to form the data computation framework
- ResourceManager: manages the resources of the applications running on the system
- NodeManager: reports resource usage to the ResourceManager and overseas containers
- ApplicationMaster: a middleman between ResourceManager and NodeManager, helping with resource management and task execution
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Resource Manager components: Scheduler and Applications Manager
- Scheduler: responsible for distributing resources to applications based on predefined criteria
- Includes plugins like CapacityScheduler and FairScheduler
- ApplicationsManager: responsible for accepting job submissions, negotiating and executing ApplicationMaster commands, as well as rebooting after an ApplicationMaster failure
YARN Federation can scale YARN beyond a few thousand nodes if needed.
Hadoop MapReduce
The main purpose of MapReduce is to provide a framework for writing applications that can process large amounts of data in parallel on large clusters. MapReduce and HDFS run on the same set of nodes to simplify the process of scheduling tasks. The framework of MapReduce includes:
-
A single master ResourceManager
-
One worker NodeManager per cluster-node
- MRAppMaster per application
The process works as follows:
- A job splits input data into blocks.
- The map tasks process the blocks in parallel.
- Outputs sort and move into the reduced tasks.
- Both input and output write to HDFS.
- Framework schedules tasks, monitor them, and re-executes any failures.
Even though Hadoop uses Java, MapReduce applications do not need to use Java. Additional tools for MapReduce include Hadoop Streaming (for running jobs with any executable) and Hadoop Pipes (a C++ API).
Read about the differences between Hadoop MapReduce and Apache Spark.
Hadoop Ozone
Hadoop Ozone began as an object store for Hadoop, but it recently spun out as its own Apache product and is now "Apache Ozone." It can scale billions of objects and is compatible with containerization programs like Kubernetes. As of this writing, it is on release 1.1.0.
Many layers make up Ozone's architecture:
- Metadata Data Management Layer: contains Storage Container Manager (SCM) and Ozone Manager (OM) layers
- Data Storage Layer: contains the nodes managed by SCM
- Replication Layer: replicates data from OM and SCM and provides data modification consistency
- Recon, the Management Server: connects the components and manages the API and UX
- Protocol Bus: allows Ozone to be extended with other protocols
The primary components in the Ozone architecture are:
- Keys: Ozone stores data as keys
- Buckets: the container in which keys live
- Volumes: a storage container for buckets — a user can create an unlimited number of buckets inside a volume
Managing ETL in Hadoop with Integrate.io
Companies adopting Hadoop need an ETL strategy that continuously moves data between their applications and systems and Hadoop. Hadoop has its own built-in ETL capability — data managers can create clusters, connect data sources, define the metadata and then start creating the ETL jobs that will populate the workflow. They can do this manually by following instructions on the Apache website, but it's much easier with an ETL tool like Integrate.io to do the heavy lifting.
Integrate.io is a cloud-based ETL (Extract, Transform, Load) solution that provides simple visualized data pipelines for automated data flows across a wide range of sources and destinations. It integrates beautifully with Hadoop to ease and speed ETL workflows. The platform's powerful transformation tools allow customers to transform, normalize and clean their data in Hadoop while also adhering to compliance best practices.
Final Thoughts
Hardware failure, downtime, and system performance issues are very common for organizations across the globe; Hadoop offers a solution that uses an innovative distributed file system to reduce these concerns. Through the five modules described above, designed with sophisticated engineering that divides data into blocks and distributes the load across low-cost machines, Hadoop brings high availability, quicker processing speeds, and massive data storage capacity to applications of any size.
Hadoop continues to be a leader of open-source systems, offering continuous updates and enhancements from its network of developers. To learn more about how Hadoop and Integrate.io work together to speed ETL, please contact us to discuss our 14-day pilot program.