Google BigQuery makes data processing cool again—it's fast, secure, and surprisingly cheap. BigQuery performs the same function for Google as Redshift does for Amazon. (Deciding between the two options? Check out our all-in-one article "Redshift vs. BigQuery: A Comprehensive Guide.")
Because BigQuery is a columnar database with append-only tables, BigQuery data processing is lightning-fast: for example, queries can scan 35 billion rows in tens of seconds, according to Google. The BigQuery product is a consumer-facing version of Dremel, an internal database technology that Google built to handle massive services such as search, YouTube, Gmail and Google Docs.
That said, you might have a tiny question in the back of your mind: do we need to throw away the old standby Apache Hadoop in favor of BigQuery data processing?
Although big data processing in BigQuery sounds like the perfect end-all solution, it’s not right for everything. Yes, it’s good for real-time querying, but Google itself recommends using Hadoop’s MapReduce rather than BigQuery for three main cases:
- Executing complex data mining on big data that requires multiple iterations and paths of data processing with programmed algorithms.
- Executing large join operations across huge datasets.
- Exporting large amounts of data after processing.
So, if you're thinking of doing any of the above kinds of data processing in BigQuery, you’ll be better to offload these workloads to Hadoop. Better yet—you can offload them to Integrate.io.
Read more about MapReduce and Spark on our blog.
Integrate.io is powered by Apache Hadoop, with the added benefit of BigQuery integration. You can take data from Google Cloud Storage and many other sources, mine it, join it, and transform it with Integrate.io. Then you can store the results on Google BigQuery for your data warehouse needs.
Ready to learn how to offload BigQuery data processing tasks to Integrate.io? Let's get started.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Google BigQuery Data Processing Settings
To start using Google BigQuery data processing, you'll first need to enable access to Integrate.io within BigQuery.
Project Access
In order to run load jobs, you need to grant read permissions to Integrate.io for the relevant BigQuery dataset project container. Follow the steps below:
- Access your Google Cloud Platform Console.
- Click on the project name for which you want to grant Integrate.io access.
- Under the "Project info" widget on the GCP dashboard, click on "Add people to this project".
- Paste Integrate.io's Google Cloud email into the "New members" text box: 869780223921-fkm5hu2ev8la67tqrntm6b5tjm6c6sgl@developer.gserviceaccount.com
- In the "Select a role" dropdown menu, click on "Project > Viewer".
- Click on "Save".
Dataset Access
In order to enable Integrate.io to write to a specific dataset, you need to grant write permissions to Integrate.io for the relevant BigQuery dataset. Follow the steps below:
- Access your Google Cloud Platform Console.
- Click on the project name for which you want to grant Integrate.io access.
- In the left sidebar menu, click on "BigQuery" under the Big Data header.
- Verify that the project name is correct in the new window. To switch projects, click on the down arrow next to the project name, then select a different project from the "Select a project" screen.
- Under the "Resources" heading in the left sidebar, select the dataset to which you want to enable access.
- Click on "Share dataset" near the right side of the window.
- Click on the "Dataset permissions" tab in the panel, then click on "Add members".
- In the "New members" text box, paste Integrate.io's Google Cloud email: 869780223921-fkm5hu2ev8la67tqrntm6b5tjm6c6sgl@developer.gserviceaccount.com
- Under "Select a role", enable editing access for BigQuery.
- Click "Done".
Getting the Project ID
Finally, you'll need to get the project ID to set up Google BigQuery data processing with Integrate.io.
- Go to the Google APIs Console.
- Click on the down arrow next to the current project name to bring up a list of all projects.
- Look for your project name and copy its project ID from the "ID" column.
For more information on these three steps, see Google's guide Google BigQuery Access Control.
Integrate.io Data Processing Settings for BigQuery
After adjusting the Google BigQuery data processing settings, you'll also need to set up the BigQuery connection in the Integrate.io settings.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Configuring the Google BigQuery Connection
- Log in to your Integrate.io account, or sign up for free.
- Click on the user image at the top right and select "Manage connections".
- Click on the new connection button and select "Google BigQuery" under the Analytical Databases heading.
- Enter a name, the project ID that you copied from BigQuery, and the relevant dataset ID.
- Click on the "test connection" button to make sure that the connection works. If it fails, one of the details may have been entered incorrectly, or you may not have given Integrate.io permissions to connect to your BigQuery dataset.
- Click on the "create google bigquery connection" button.
Output Processed Data to Google BigQuery
- Open or create a new Integrate.io package under My Packages.
- Add the Google BigQuery destination component by clicking on the + (plus) button under the relevant component.
- Click on the component to edit it.
- Select the relevant Google BigQuery connection, and enter the target table. If you wold like, you can auto-fill the fields with the button on the right.
- Click on the "okay" button. Don’t forget to save the package before running it.
Data Processing in BigQuery: Is It Right for You?
As already discussed, Google BigQuery is great for real-time querying of huge datasets. For optimal performance, however, you should offload batch processing to Apache Hadoop—especially for tasks like complex data mining, large joins, and exporting large amounts of processed data.
Integrate.io, which is powered by Hadoop, can easily offload these workloads, and can write the results back to Google BigQuery. If you think that Integrate.io is the right tool to offload your data to BigQuery, we’re happy to provide a demo, a risk-free trial, and a free setup session with our implementation team!