Customers sometimes need to quickly write vast amounts of data to a 3rd party REST API. Calling an API single threaded can be very slow. Customers ideally want to call their API many times in parallel to speed data throughput.
Integrate.io uses Hadoop MapReduce to distribute computing tasks over a cluster of servers and is ideal for big data sets that are unable to be processed on a single server (due to memory or disk constraints). This enables the parallel processing of petabytes of data.
MapReduce breaks processing into four main phases:
-
Map the output are key-value pairs grouped by the key
-
Shuffle all the values with the same key are collected together by the Reducer
-
Sort all the records are sorted by the key by the Reducer
-
Reduce aggregation of the data values by the Reducer
Customers can perform Curl calls in parallel to speed up the writing of data to a 3rd party API. Adding a Sort component right before the Select component that does the Curl() ensures the task is distributed across the customer’s cluster nodes (as it is then processed by a Reducer and not a Mapper).
By default Integrate.io uses a single Hadoop Reducer (the Integrate.io variable _DEFAULT_PARALLELISM is set to 0), _DEFAULT_PARALLELISM can be set to 5 for a single node Hadoop cluster enabling it to process 5 parallel threads.
Increasing the Integrate.io variable_DEFAULT_PARALLELISM only helps if the customer’s Hadoop cluster has multiple nodes, eg 15 parallel threads, _DEFAULT_PARALLELISM, requires a 3 node cluster.
Regular Curl Vs Parallel Curl
It is important to check that your 3rd party API can handle the write load or the job will fail with rate limiting. Services like Intercom and Salesforce often need higher tier plans for the additional API calls per minute and data throughput.
Many of Integrate.io’s components are designed to run in parallel by default, including database connections and workflows. Please reach out to support for more information.