In order to understand popular data warehouses like Amazon Redshift, you first need to understand their underlying architecture and the core principles upon which they are built. Massively Parallel Processing (or MPP for short) is this underlying architecture. In this guide, we’ll dive into what an MPP Database is, how it works, and the strengths and weaknesses of Massively Parallel Processing. If you have any questions, let us know in the comments!
While storage and computing power have come long a way in the last several decades, the unfortunate reality is that they haven’t kept up with modern data storage and analysis needs. MPP databases solve this problem by allotting the required processing power onto several different nodes to most efficiently analyze large datasets.
Big Data Analysis: A Human Example
Scale Horizontally, Not Vertically
To see how an MPP architecture makes processing large datasets more effective, let’s step away from the world of computers for a minute, and see how we might solve a similar problem with people instead of servers. Let’s pretend that you are a researcher and your lifelong dream is to count the total number of words in the Library of Congress. After registering in person in Washington, D.C. (all that is required, amazingly), you’re granted access and you grab the first book you see off the shelves and you start counting.
In this example, you represent one server going through the data (the library of congress) and processing the query (SUM of words), and it’s easy to see the problem with this: According to Wikipedia, the Library of Congress contains over 167 million unique items. At a normal reading speed, it would take you tens of thousands of years to get through every text.
Recognizing this, you attempt to increase your reading speed (scaling vertically). You invest in speed reading classes and you buy some expensive tools to help you open books and flip through pages faster. This allows you to double the number of words you’re able to count in a minute, but even at this increased rate, it’s apparent that you won’t complete your task until thousands of years in the future.
What if, instead of investing in increasing your processing power, you asked the 3,000 people that Library of Congress employs for help? While you can only get so much faster at counting words, your ability to scale horizontally by getting more people is almost unlimited. The Head Librarian decides this project would be a great use of taxpayer money and dedicates every employee to this task and puts you in charge of everyone.
Think about this situation for a minute: What makes the most sense to do? How would you organize everyone to make sure you got results quickly, efficiently, and accurately?
The easiest thing to do would probably be to assign each person what looks like a roughly equal portion of the library to tackle, say...everyone gets a shelf. It won’t be exactly equal, but it will be close enough. As everyone is reading through the texts in their assigned shelf and counting up the words, you, as the leader is organizing everything: you make sure that everyone has a shelf assigned, that they get snacks and water and bathroom breaks as needed, and you find some temp workers to fill in when a few people call in sick.
When an employee is done counting all the words in their shelf, they write the number down on a sticky note, and walk over to your desk and hand it to you. You then take the number on the sticky note and add it to your running tally of total words. When everyone is done with their shelves, you have your final tally!
This is Massively Parallel Processing in action, just with humans instead of computers. Splitting up simple but large tasks into multiple buckets and getting those buckets processed at the same time will be much faster than one person working alone, no matter how skilled that person is.
What is an MPP Database?
Clusters and nodes, oh my!
Simply put, an MPP database is a type of database or data warehouse where the data and processing power are split up among several different nodes (servers), with one leader node and one or many compute nodes. In MPP, the leader (you) would be called the leader node - you’re the telling all the other people what to do and sorting the final tally. The library employees, your helpers, would be called compute nodes - they’re dealing with all the data, running the queries and counting up the words. MPP databases can scale horizontally by adding more compute resources (nodes), rather than having to worry about upgrading to more and more expensive individual servers (scaling vertically). Adding more nodes to a cluster allows the data and processing to be spread across more machines, which means the query will be completed sooner.
Without this structure, running even the simplest of queries on large dataset would take a prohibitively long time.
GET BACK TO WORK.
We'll handle your ETL pipeline.
MPP vs. Alternatives
Hadoop mostly
Massively Parallel Processing is not the only technology that facilitates the processing of large volumes of data. We have a full analysis comparing Hadoop Hive and Redshift, which we encourage to you check out.
To summarize, unlike Hadoop, MPP databases utilize a “share-nothing” architecture. Sets of data don’t overlap and all communication is via a network. Each node or server contains the data it is responsible for (the shelf of books) and the computing power to analyze that data (the librarian). This makes MPP databases much easier to deploy and maintain. Cloud MPP databases, such as Amazon’s Redshift database, are also cost-effective and support SQL-based business intelligence tools such as Looker and Tableau.
A common issue with MPP databases structuring the data and MPP databases do not support unstructured data and, and even structured data, such as that from a MySQL or PostgreSQL database will require some processing to make sure it fits the MPP structure. This is because MPP databases are usually columnar, which allows analytical queries to be processed faster. We recommend you read our guide on columnar databases for more.
For these reasons, setting up your pipeline to load your data into Redshift smoothly and easily can be quite a project. This is especially true if you want your data to be replicated at near real-time, which is usually the case for tracking important business metrics. This is where Integrate.io comes in. Integrate.io provides continuous, near real-time replication into Redshift from your transactional databases, such as MySQL, PostgreSQL, Amazon Aurora, and more. With an easy, one-time setup, our robust system ensures 100% accuracy with each load. Your data is always up-to-date.