The ever-increasing need for data storage and analytics has led to unprecedented growth in various data ingestion, storage, analytics, and warehousing technologies. In this article, we will focus on BigQuery and how to load data into this data warehouse.
BigQuery is an enterprise-grade data warehousing solution offered by Google via the Google Cloud Platform (GCP). This serverless, fully managed data warehouse is designed to handle petabytes of data quickly, efficiently, and cost-effectively.
Table of contents
- Advantages and Disadvantages of BigQuery
- How to Import Data into BigQuery
- Choosing a Data Loading Method
- Conclusion
Advantages and Disadvantages of BigQuery
BigQuery is fully managed and serverless, eliminating the need for users to worry about managing, updating, and optimizing the underlying infrastructure. Google handles all those aspects while ensuring the performance, availability, and scalability of data.
BigQuery uses standard SQL dialect for DDL and DML without the need to learn new SQL dialects. Moreover, it has client libraries that support multiple languages such as Java, Python, Go, C#, etc. This allows the developers to develop and interact with BigQuery using their preferred programming language.
Furthermore, BigQuery offers built-in support for Geospatial data types and functions (BigQuery GIS), Natural language processing (Data QNA), ML and AI integrations (BigQuery ML), and visualization tools (Google Data Studio). This allows users to easily transform and analyze data stored in BigQuery without having to go through complex configuration or normalization processes.
Automated backups and restore facilities of BigQuery enable users to restore and compare historical data easily. Additionally, BigQuery provides geographical data separation by storing data in different physical data centers. This allows organizations to meet data storage compliance requirements more conveniently. All these features are become available through the security and governance options built into the GCP platform.
When considering the disadvantages of BigQuery, users need to have extensive knowledge in SQL to get started with BigQuery as it relies on properly optimized SQL queries. Otherwise, it will lead to higher operational costs as unoptimized queries consume more resources and data. Besides, the UI of BigQuery can be a bit clunky, and there is also a steep learning curve.
How to Import Data into BigQuery
Now that we know what BigQuery is. So let’s move on to see how to load data into it. BigQuery provides multiple ways to load data and the following are some of the available methods.
MANUAL DATA INSERTS
The most basic way to load data is to insert them manually. We can use DML statements to add data to a table as single or multiple INSERT statements. This can be done either via the BigQuery Web UI or programmatically via a client library.
BATCH DATA LOADING
The batch loading method can be used when a user needs to load a collection of data as a single operation. The ETL (Extract, Transform and Load) data ingestion falls under batch data loading. Batch jobs can be scheduled as one-time or scheduled jobs.
BigQuery supports data loading from cloud storage or a local file using a load job. The supported record formats are Avro, CSV, JSON, ORC, and Parquet. Moreover, Google offers the BigQuery Data Transfer Service to batch load data from Google SaaS apps and third-party applications. Besides, you can use data exports such as Firestore database exports to import data through batch loading.
STREAMING DATA
Instead of using load jobs, users can stream data directly into a BigQuery table as individual records or batches. This method is well suited when dealing with real-time or continuous data flows. We can use the streaming API or Google Dataflow with Apache beam to set up streaming.
THIRD-PARTY SOLUTIONS
Some third-party applications such as Skyvia and Informatica support loading data into BigQuery. Yet, the ease of use and configuration depends on the type of application and user requirements.
Choosing a Data Loading Method
Depending on the requirements, users need to select a proper data loading method or use multiple methods to load data into BigQuery throughout the application lifecycle. For instance, they can use a batch load job to load existing data into BigQuery while using streaming inserts to ingest data at runtime or programmatically using DML statements. Load latency, data change frequency, and reliability are some aspects that need to be considered when selecting a data loading method. You can find more information related to this in the BigQuery documentation.
Choosing File Format for Loading Data
Choosing an appropriate file format is one of the crucial factors that needs to be considered when loading data. Users will undoubtedly face issues while ingesting data or even completely fail the data loading process without a proper file format. Thus, we need to adhere to the following guidelines provided by Google to mitigate these issues.
SUPPORT FOR SCHEMA IDENTIFYING
On the one hand, BigQuery can automatically detect the table schema with self-describing formats such as Arco, Parquet, Firestore exports, and ORC. On the other hand, users need to explicitly provide a schema or use a schema auto-detection feature for formats like JSON or CSV.
FLAT DATA VS. NESTED AND REPEATED FIELDS
While most file formats support Flat data, formats like Arvo, JSON, ORCm Parquet, and Firestore exports support nested and repeated fields. Therefore, if the user needs hierarchical data, the only option is to use such a format that supports nested and repeated fields. It also helps to reduce data duplication.
EMBEDDED NEWLINE
BigQuery requires JSON files to be newline-delimited. This translates to newline-delimited JSON files containing a single record per line.
ENCODING
The primary encoding format of BigQuery is UTF-8 with support for both flat and nested or repeated data. Additionally, BigQuey supports ISO-8859-1 encoding for CSV files containing flat data.
Choosing a Sample Data Set
In this section, we will see some practical examples on how to load data into BigQuery. We will use the below data set as the sample data for all the data loading operations.
user_id | first_name | last_name | age | address |
001546 | Barry | Stevens | 21 | 82, Kings Road |
002546 | Harry | Potter | 30 | 785/88B, Lakewood |
020054 | Sam | Smith | 25 | 010, Mud Lane |
015778 | Kevin | Lee | 31 | 875, Melrose Avenue |
002336 | Matt | Harris | 27 | 99/110, Lake Drive |
Loading Data Using BigQuery Web UI (Cloud Console)
Before getting started, we need to create a data set. A data set in BigQuery is a top-level object that is used to organize and control access to the tables and views.
Step 1 . Navigate to the web UI and click on the Create data set option on the project.
Step 2 . Provide a name and data location on the data set creation page. This corresponds to the physical location where data is stored.
Step 3 . Now, we can create a table within the data set to insert data. In the Create table option, the user will be presented with multiple data sources that can be used to create tables.
CREATING AN EMPTY TABLE AND INSERTING DATA VIA DML.
The empty table option allows users to define a schema and create an empty BigQuery table.
Once the table has been created, we can load data via DML, specifically using the following INSERT query executed in the BigQuery SQL workspace.
INSERT
`test-applications-315905.test_dataset.user_details` (user_id,
first_name,
last_name,
age,
address)
VALUES
(001546, 'Barry', 'Stevens', 21, '82, Kings Road'),
(002546, 'Harry', 'Potter', 30, '785/88B, Lakewood'),
(020054, 'Sam', 'Smith', 25, '010, Mud Lane'),
(015778, 'Kevin', 'Lee', 31, '875, Melrose Avenue'),
(002336, 'Matt', 'Harris', 27, '99/110, Lake Drive');
RESULT
CREATING A TABLE USING A CSV FILE
Among the multiple available data source options, we can use a local CSV file as the data source for a batch loading job by selecting the Upload option and pointing to a local CSV data file. Since this is a CSV file, we need to define the schema for the table. In the following example, we are using the auto-detection feature in BigQuery to identify the schema and input parameters automatically.
This feature will automatically create a new table called “user_details_csv” and load all the data in the CSV file.
The Web UI is the most user-friendly and convenient option for anyone who is new to BigQuery. With all the options presented in an interactive UI, users can easily create and manage BigQuery resources and run SQL queries. However, one downside of this Web UI is that it can be a bit slow and time-consuming compared to other options. Besides, it’s easier to configure some advanced options via a client library or CLI rather than using BigQuery. Please refer to this youtube video by Google Cloud Tec for the complete process of loading data and utilizing cloud storage.
Loading Data Using the CLI
The bq command-line tool is a Python-based CLI for interacting with BigQuery. It is the most efficient way to interact with larger BigQuery datasets. Using it, users can create CLI commands and store them or add them in scripts to simplify the interactions and management of BigQuery. The scripted approach is especially helpful when troubleshooting and monitoring data sets. One downside of this CLI is that users need an in-depth knowledge of the BigQuery platform and the underlying data structure to use the CLI interface.
Now let’s see how we can use this bq CLI to load a JSON data set into BigQuery. We will be explicitly defining the table schema in the command as shown below. Make sure that the JSON file is in Newline Delimited format.
bq load --source_format=NEWLINE_DELIMITED_JSON \
test_dataset.user_details_json user_details.json \
user_id:integer,first_name:string,last_name:string,age:integer,address:string
RESULT
Then, we can query for the “user_details_json” table to verify if the load function was successful.
bq show test-applications-315905:test_dataset.user_details_json
RESULT
Loading Data Using REST API
Another method to load data into BigQuery is via the REST API. The best way to utilize the API is through the Google provided client libraries for different programming languages. These client libraries will directly communicate with the REST API and provide developers with a streamlined development experience. Moreover, they handle all the low-level communications, including authentications, eliminating the need for developers to create API calls from scratch. This is the ideal option when integrating BigQuery with your software applications or scripts.
However, a user needs to know the specific programming language as well as the functionality of the client library to use these libraries, which in turn leads to a high learning curve. The following code is an example of loading data through the Python client library using a CSV file stored in the Google Cloud Storage Bucket.
from google.oauth2 import service_account
from google.cloud import bigquery
# Create Authentication Credentials
project_id = "test-applications-xxxxx"
table_id = f"{project_id}.test_dataset.user_details_python_csv"
gcp_credentials = service_account.Credentials.from_service_account_file('test-applications-xxxxx-74dxxxxx.json')
# Create BigQuery Client
bq_client = bigquery.Client(credentials=gcp_credentials)
# Create Table Schema
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("user_id", "INTEGER"),
bigquery.SchemaField("first_name", "STRING"),
bigquery.SchemaField("last_name", "STRING"),
bigquery.SchemaField("age", "INTEGER"),
bigquery.SchemaField("address", "STRING"),
],
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
)
# CSV File Location (Cloud Storage Bucket)
uri = "https://storage.cloud.google.com/test_python_functions/user_details.csv"
# Create the Job
csv_load_job = bq_client.load_table_from_uri(
uri, table_id, job_config=job_config
)
csv_load_job.result()
If we look at the data set now, we can see that a new table has been created within the dataset.
Loading Data Using Third-Party Solution
Skyvia is one of the third-party solutions that supports data loading into BigQuery. It also supports data loading to other data warehouses, databases and cloud apps. To use Skyvia, you do not need any locally installed software except for a web browser and a registered account.
No coding skills and manual efforts are required. Skyvia allows automating the process to run data on schedule without human intervention.
There are three steps of successful data import to BigQuery with Skyvia:
STEP 1: CREATE CONNECTION TO BIGQUERY IN SKYVIA
To create a connection, select Google BigQuery among the list of data sources supported by Skyvia and, in the opened BigQuery connection window, sign in via OAuth 2.0 and specify the parameters, which are required to connect to BigQuery. Among required parameters are Project Id, DataSet Id, Cloud Storage Bucket.
If you load data to BigQuery from some cloud apps or databases, you need to create a connection to such a cloud app or storage service.
STEP 2: CREATE IMPORT PACKAGE TO LOAD DATA
To create an import package, select it to open the package editor and configure package settings.
First, you need to select source from where you will load data. It can be a CSV file from files storages like Amazon S3, Azure or you can load data directly from databases like SQL Server or cloud apps like BigCommerce, HubSpot and QuickBooks, etc. Second, you need to select a target (Google BigQuery) to load data to. Third, you need to add a task to the package, i.e. to specify a CSV path to get your CSV file from, specify CSV options (for example CSV separator you want to use), select a target object in BigQuery to load data to, select operation type (for example INSERT) and configure mapping settings between CSV file from Amazon S3 and Google BigQuery for better matching.
When one task is ready, you repeat the same steps if you want to add another task with another CSV file.
STEP 3: SCHEDULE YOUR PACKAGE FOR AUTOMATIC EXECUTION
This feature of Skyvia might be useful if you want to configure data loading operations to run periodically or if you want to delay an operation to a later time.
Based on your needs, you can schedule the created package to execute either once (one-time execution) or at recurring intervals (repeated execution). You can run your packages on a daily, weekly and monthly basis — once a day at a specific time or multiple times with intervals. You can also select certain weekdays you want your package to be executed.
Once the package is scheduled, the automated data loading process starts. If any error occurs, you will be notified about it. Everything is transparent, simple, and fast.
Streaming Data Ingestion
The streaming data ingestion is suited for continuous real-time data processing, inserting one record at a time. This is done using the “tabledata.insertAll” API reference. Steamed data will be first written to the streaming buffer and then written to the actual table in columnar format.
While the data is in the streaming buffer, users cannot interact with the records to update or delete the record. Due to that, these streaming inserts are more suitable for data streams that do not require immediate alterations to the data. Below you can find an example for streaming data to the “user_details” table.
from google.cloud import bigquery
from google.oauth2 import service_account
# Create Authentication Credentials
project_id = "test-applications-xxxxx"
table_id = f"{project_id}.test_dataset.user_details"
gcp_credentials = service_account.Credentials.from_service_account_file('test-applications-xxxxx-74dxxxxx.json')
# Create BigQuery Client
bq_client = bigquery.Client(credentials=gcp_credentials)
# Data to Insert
rows_to_insert = [
{u"user_id": 254475, u"first_name": u"Alice", u"last_name": u"Marie", u"age": 32, u"address": u"45, Lex Drive"}
]
# API Call
bq_client.insert_rows_json(table_id, rows_to_insert)
The above code will add a new record to the “user_details” table. However, if we try to delete the row immediately, it will throw an error since data is still in the streaming buffer.
Data Transfer Service
This fully managed data transfer service is aimed at providing a streamlined method to ingest data from other Google SaaS applications such as Google Campaign, Ad manager, Youtube reports, external storage providers like AWS S3, and data warehouses like AWS Redshift or Teradata. This is the best option to tackle large data sets while automating data ingestion and migration.
Conclusion
Google provides users with multiple ways to interact with BigQuery. In this article, we discussed the tools and methods available for users to load data into BigQuery easily. In a real-world scenario, users should utilize the best available solution that meets their exact needs to import data into BigQuery or use a combination of these methods to optimize the overall data pipeline.
Third-party solutions like Skyvia will be also a good option if you need to automate the process of data loading.