Processing Big Data on a traditional RDBMS infrastructure is like trying to cook a turkey in a microwave. Even if you somehow manage to stuff it all in, it’s still going to end in disaster.
Big Data is fundamentally different from relational data. To process it, you need an entirely new technological paradigm: the Big Data stack.
- Why Do You Need a Big Data Stack?
- What’s the Architecture of a Big Data Stack?
- 4 Main Challenges of a Big Data Stack
- Getting Value from Your Big Data Stack
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Why Do You Need a Big Data Stack?
Big Data is something of a misnomer. It’s not just a matter of size, although Big Data sets tend to be a matter of terabytes or even petabytes. Big Data differs from traditional data in a number of ways, known as the seven V’s:
A traditional data stack supports traditional data. You have some relational databases attached to production systems, you extract and transform this data with ETL, and then you upload it to a data warehouse.
Big Data presents a whole range of new challenges. To succeed, you need an entirely new kind of infrastructure.
What’s the Architecture of a Big Data Stack?
A Big Data stack is different from a traditional stack on almost every level, from hardware right up to the analytics tools. Let’s look at each layer in turn, starting from the bottom of the stack and moving up.
Hardware Layer
Big Data architecture uses the concept of clusters: small groups of machines that have a certain amount of processing and storage power. When you need to increase capacity within your Big Data stack, you simply add more clusters – scale out, rather than scale up.
Clustering usually involves a high rate of redundancy, so that if one cluster is busy or unavailable, the task moves to another cluster. This delivers a high level of reliability and latency, even though the clusters themselves tend to run on low-cost commodity hardware.
Data Layer
The hardware layer calls for cluster-oriented storage structures, such as Apache Hadoop, or cloud-based services like Amazon S3. These technologies use data processing frameworks like MapReduce to provide rapid answers to queries.
In terms of data organization, several solutions often sit side-by-side. Relational databases will support structured data. Columnar databases are relational databases optimized for cluster storage. NoSQL databases help organize unstructured data.
Ingestion Layer
To get data into the Big Data stack requires a new ingestion approach. Most commonly, this involves ELT (Extract, Load, Transform). ELT is similar to ETL, except that there’s no transformation layer between source and destination. This speeds up data transfer and allows you to process any kind of information.
ETLT is a compromise approach in which you apply some initial transformations before ingestion. You can use this approach to cleanse and tag data, which will then speed up future transformation.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Application Layer
When users interact with a Big Data stack, they’ll generally do so with a Business Intelligence or analytics tool. These tools use a number of sorting and statistical techniques to find the most relevant trends in data, even if those trends are deep within a data lake.
Some third-party tools can connect directly to the Big Data repository. For security and functionality reasons, however, you may prefer to connect these tools to the ELT platform. For example, Integrate.io integrates directly with tools like Chartio and Dundas BI.
4 Main Challenges of a Big Data Stack
Working with a new kind of stack can mean new kinds of problems. They're all solvable if you have the right approach.
Integration
The problem: Big Data means working with large volumes of structured, semi-structured, and unstructured data from disparate sources. How can you bring all of these elements together in a single repository, facilitating detailed analytics?
The solution: A data lake is a distributed file system that you can use to store all of your data in one place. Lakes use file systems like Hadoop HDFS or with cloud-based options like Amazon AWS S3 or Microsoft Azure. The lake structure will hold all of your data in its original format, regardless of structure or compatibility with other data.
To fill the lake, you’ll need an ELT (Extract, Load, Transform) process. This is similar to ETL (Extract, Transform, Load), except that there is no intermediate transformation layer.
Instead, ELT connects directly to your data sources and extracts what you need. It then loads this data directly to the data lake, and stores it as a document or blob. This approach is inherently faster than ETL, as the data doesn’t pass through a transformation layer.
Combination
The problem: Without any organizing principles, a data lake is just a form of file storage. How do you connect all of these elements so that you can search and analyze the lake’s contents?
The solution: NoSQL (also known as “Not Only SQL”) is an organizational structure for Big Data. Unlike SQL, there’s no standard implementation of NoSQL. Instead, you’ll find a variety of different flavors, each suited to a different use case. The main versions are:
- Key-value: A fast and light system that is best for read-only data. Example: Apache ZooKeeper.
- Document store: A key-value database where each value is a document, typically a JSON file. Example: MongoDB.
- Wide-column: Similar to a relational database, except that rows don’t all have to have the same column structure as each other. Example: Google BigQuery.
- Graph database: Uses a system of nodes and edges to describe relationships between values. Example: Apache Giraph.
NoSQL systems allow you to efficiently query the contents of your data lake, without having to organize those contents beforehand. An automated ELT process can populate a NoSQL system according to your requirements.
Transformation
The problem: Relational database structures like data warehouses are schema-on-write. Data transformations happen prior to ingestion, so it’s easy to query at any time. How does this work with Big Data?
The solution: Data lakes are schema-on-read. Transformations–such as validation, cleansing, and integration–only happen when you need them.
If the data lake is well-managed, this isn’t a problem. NoSQL already applies some structure to your data. Tools like MapReduce and Spark can process through terabytes of data at astonishing speed. In practice, this can sometimes produce as fast as schema-on-write, if not faster.
Performance is a function of data quality, though. If your lake isn’t supplied by an automated ELT platform and managed with strong data governance policies, your queries will get slower and less accurate over time.
Analytics and BI
The problem: Many BI tools only work with tidy, structured data. Big Data is so vast and disorderly. Where do you even start when looking for insights?
The solution: You can perform manual analysis on structured data using basic tools. Even something like Excel will allow you to create detailed visualizations. But with Big Data, only a data scientist can perform a manual analysis.
Which is why very few people rely on manual techniques these days. Analytics and BI tools can do all the important jobs like:
- Data exploration: An initial investigation to flag up any potentially interesting patterns within Big Data.
- Statistical analysis: Apply techniques like regression modeling and cluster analysis to derive actionable insights.
- Visualization: Turn analytics insights into visual information, such as charts and graphs.
- Real-time analytics: Create a live dashboard that shows the organization’s current state.
Analytics and BI tools can connect directly to your ELT platform. This is often safer and faster than linking directly to your Big Data repository.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Getting Value from Your Big Data Stack
Of all the layers in a Big Data stack, the most important is probably the ingestion layer. Your data pipeline impacts the speed of data ingestion and ensures a high level of data quality. If the ingestion layer isn’t right, then you’ll end up with an empty lake or with a data swamp.
Data pipelines are also a crucial security measure. Automated ELT platforms will encrypt all traffic from end to end. Data transactions aren’t logged on the platform, which means that there’s no chance of data leaking.
Want to discover how Integrate.io's ELT can power your Big Data stack? Book a demo and see for yourself.