MongoDB is a popular non-relational (a.k.a NoSQL) database. It is document-oriented and distributed in nature. MongoDB is known to be highly scalable and flexible. In this post, we'll demonstrate how you can utilize MongoDB in your ETL pipelines with Integrate.io. To start with, let us briefly discuss why and when you'd want to use MongoDB over other relational databases.
- How MongoDB Differs From Relational Databases
- Benefits of Using Integrate.io with MongoDB
- Using MongoDB in an Integrate.io Pipeline
- Implementing the Pipeline
- Conclusion
How MongoDB Differs From Relational Databases
In a relational database, we model our data in terms of tables, columns, and relationships among them. In MongoDB, the information is instead represented as a collection of documents. These documents are like JSON objects and, by default, do not enforce any structure. This gives a lot of flexibility for data-modeling.
One popular rule of thumb for deciding document structure in MongoDB is "what's used together should be stored together". We can store one-to-many (or many-to-many) relationships in array fields in the same document. This approach of "embedding" makes the queries highly performant and prevents the need for expensive JOIN operations. Because of the self-contained nature of the documents, MongoDB is also able to provide nice features of distributed databases like sharding, synchronization, and replication enable scalability across multiple machines.
Of course, this way of representing data requires more storage space, but the benefits may outweigh the cost depending upon your use-case.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Benefits of Using Integrate.io with MongoDB
There are several reasons why a business would choose to use Integrate.io with MongoDB:
In the next section, we'll see in detail how you can use MongoDB in your Integrate.io pipelines.
Using MongoDB in an Integrate.io Pipeline
To use any database in Integrate.io, there are two steps involved: setting up a database connection and adding the database component to the pipeline.
Setting up a MongoDB Connection
To use MongoDB in any of the Integrate.io pipelines, you'll first add a MongoDB connection in the Integrate.io dashboard. The MongoDB connection in Integrate.io supports SSL, which means that data transmission in and out of MongoDB happens over a secured protocol.
Like most database services, there are two ways to set up MongoDB: managed or self-hosted. The self-hosted version involves installing the MongoDB binaries yourself on a machine, whereas in the managed version, you get an out-of-the-box setup that allows you to connect to the database right away. The managed offering provided by the MongoDB team is called MongoDB Atlas. It comes with some nice features like auto-scaling, monitoring, alerting, backups, and integration with other tools like Data Lake.
Using the Database Components
Integrate.io provides both a source component and destination component for MongoDB. This means you load data from and save data to a MongoDB connection in your pipeline. These components, combined with other features of Integrate.io like select transformations, filter transformations, curl functions, provide a lot of flexibility in terms of how you can shape the data.
The source and destination components support the major MongoDB data types. The following table shows how these data types map to the data types native to Integrate.io:
MongoDB Data type
|
Integrate.io Data Type
|
32 Bit Integer
|
Integer
|
64 Bit Integer
|
Long
|
Double
|
Double
|
Boolean
|
Boolean
|
String
|
String
|
Object
|
Map
|
Array
|
Bag
|
The support for Object and Array data types is quite helpful. This data is stored as BSON (Binary JSON), and it makes it possible to do some advanced configurations like indexing and optimizing queries.
Example Use Case
An online commerce business maintains information about its customers in a relational database table (we'll use MySQL database for this example). The company wants to perform some analytics on the data to calculate various metrics for each of the customers. To do that, they'll first have to add customer orders data, which is available to them through a RESTful API. Once the order data is combined with customer information, it can be transformed according to the requirements and ingested in a MongoDB collection. This collection can then be used as the data source for any analytics tool/engine to perform analytics on top of it.
To get the order details for a customer, we need to query the /orders
resource endpoint via the POST HTTP method and provide the customer's email in the request payload. The JSON response returned by the API for the "Orders" endpoint for a particular customers looks something like this:
{
"orders": [
...
{
"id": 29393,
"total_price": 100,
"tax_lines": [
// These objects themselves are nested and so on.
]
...
},
{
...
},
...
]
}
Here's how the complete response looks.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Implementing the Pipeline
Now that our requirements are defined, let's see how we'd implement such a pipeline in Integrate.io.
Adding the MongoDB Connection
To set up a MongoDB connection in the Integrate.io dashboard, make sure you have access to the following details from MongoDB:
- Hostname
- Username and password
- Database name
For this example, we will use the details of a free cluster set up on MongoDB Atlas. If you're using Atlas, you can find the hostname when you click the connect button in your dashboard's clusters tab.
Next, go to the Database Access tab for the user and password details. Create a new user, and take note of the username and password. Enter these values in New Connection configurations in Integrate.io. Be sure to check the "Connect using SSL'' option so that Integrate.io can connect to the Atlas cluster in a secured way. By default, Atlas cluster comes with a database named myFirstDatbase. If you want to use another database, you can create one via the Mongo shell. Finally, you can test the connection before saving to make sure everything works.
Integrate.io also supports some advanced configurations like specifying Read Preference, Authentication Database, and connecting directly to replica set members of the cluster. You can learn more about them in the Allowing Integrate.io access to MongoDB docs.
Adding a Connection to the Relational Database
If you want to use MongoDB as a source database, you must configure it separately. For this example, we will use a locally hosted MySQL database (exposed through a ngrok tunnel).
Setting up the ETL Pipeline
Now that we have data connections in place, we can start working on the pipeline.
Create a new dataflow package from the dashboard and give it a descriptive name like "Customers and Orders to MongoDB."
The pipeline implementation has three components:
Let's take a look at each of these components in detail:
MySQL database source component: In the Input connection setting, we select the connection we created earlier. Then we provide the table name (customers in our case). After that, Integrate.io automatically pulls up the schema with a data preview. After selecting the fields we want to use, giving them an alias, and verifying their data types, we can save it.
Select transformation component: In the select component, we'll add an expression that uses Integrate.io's Curl method to fetch the order details for a customer and assign the response body to a field.
Since it might be inconvenient to construct the expression in the fields inside the dialog box, we can open the expression editor by clicking the pencil icon in the left and construct the expression there. As you can see, we constructed the JSON body for the request with TOMAP and ToJson functions:
Database Destination component: Next, we'd want to ingest all of the data in a MongoDB collection. Here, the operation type has been specified as "Merge with existing data using the update and insert". You can check our docs to understand in detail what this operation does. The columns to write to the database can be populated using the Auto-fill option. We need to select the columns comprising the Primary Key or Composite Key that can uniquely identify an entry in the database table. If we want to store any of the map or bag fields as BSON instead of a string, we can check the BSON checkbox.
Here's how the pipeline will look like upon completion:
Pipeline in Action
Once implemented, we can validate and then execute the package by creating a job. The job status can be monitored from the Integrate.io dashboard.
Once the job is finished, we can verify if the data is saved correctly by running queries through a MongoDB shell.
Conclusion
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
In this post, we discussed using MongoDB to store data and went over the steps to add MongoDB to an Integrate.io pipeline. We connected to a managed MongoDB instance set up on Atlas over a secured protocol, combined additional data fetched from a REST API, and ingested it in MongoDB's native data types (arrays, objects, strings, etc.).
We touched only on the basic functionalities in this post, but there are more advanced things that you can do with MongoDB in Integrate.io. For more complex use-cases, refer to our Knowledge Base. If you have not yet tried Integrate.io for yourself, schedule a call with our Support team to find out if it's a good fit for your use case.