What Is Batch ETL Processing? The Only Guide You Need

Data has become a staple to many modern businesses. We have become more data-driven in critical decisions. So, we collected more data. In fact, global data creation is projected to grow to more than 180 zettabytes by 2025. That’s equivalent to around 35,667,673,068,966 DVDs. And that’s a lot! So, there’s an ever-growing trend to process them faster. You can do it with ELT (Extract, Load, Transform). Or use ETL. This article will focus on Batch ETL. Then, we will compare it with its close cousin, streaming ETL.

ETL batch processing is a more traditional but solid approach. Meanwhile, stream processing is the newer kid on the block. So, which of the two is the one you need? Find out in this in-depth comparison.

Table of contents

  1. What is ETL (Extract, Transform, Load)?
  2. Batch ETL Processing: How it Works
  3. Streaming ETL Processing
  4. Batch Processing vs. Streaming Processing
  5. Batch and Streaming Processing Pros and Cons
  6. Examples of ETL Batch Processing Using Skyvia
  7. Conclusion

Let’s dive in.

What is ETL (Extract, Transform, Load)?

ETL is a data pipeline that involves a three-step process: extract, transform, and load.

The following shows how it works:

Typical ETL process

This 3-step process goes like this:

First, extract the data from various sources. It can be flat files like CSV or Excel. Or from relational databases like Oracle or SQL Server. And it can also be third-party sources like SalesForce or HubSpot.

Then, transform the data by cleaning, filtering, summarizing, and more.

Finally, load. So, this type of data processing has reached maturity. And it’s typical to use ETL batch processing in the earlier days. Here are some specific examples of ETL batch processes today:

ETL is a since the 70s that involves a three-step process: extract, transform, and load.

  • Processing of monthly sales data into a data warehouse.
  • Gathering daily fingerprint scans with date and time from a biometrics machine. This is useful for employee time and attendance.
  • Combining client profiles for two merging companies.

But ETL can also be real-time or streaming ETL. This is perfect for data that needs to be processed immediately.

The following section describes further how a batch ETL works.

Batch ETL Processing: How it Works

Batch ETL definition is simple. It refers to data processing in batches that covers a time period.

The following describes a batch ETL processing:

  • Volumes of data are extracted and processed from the source. Extraction is possible via a query language like SQL or a bulk or batch API for SaaS like Salesforce.
  • Extraction covers a batch window or a time period.
  • Processing is triggered manually or scheduled.

Let’s have some examples.

Batch ETL Processing: How it Works

Above are 2 ETL batch processing examples that happen through a stream of time.

The first one goes like this:

  • The administrator ran the batch process manually.
  • He extracted server logs like CPU utilization, disk space, RAM usage, and more.
  • Coverage is within a batch window – logs from last night. Could be from 7:00 PM to 11:59 PM.

Meanwhile, at the start of the next day, the ETL scheduler triggered another batch process. And here’s how it worked:

  • The process was automated to run every first of the month.
  • Extracted data covers posted accounting entries from July 1 to July 31 of the current year.

So, data collection occurs first in batch ETL. Then, the extraction takes place at a set time or as needed. You can perform these sorts of ETL batch processes using modern ETL tools.

But batch ETL can be challenging as you will see later. First, let’s describe what streaming ETL is.

Streaming ETL Processing

So, what is streaming ETL?

Streaming ETL is a data pipeline that starts as soon as a record is available in the source. It’s sometimes called real-time ETL or stream processing. Though it’s not literally real-time. Because it takes a fraction of a second to extract, transform, and load the whole thing.

Real-world examples of streaming ETL include:

  • Credit card fraud detection.When you swipe your credit card, a transaction is sent to a fraud detection system for analysis. Then, a deny or approve status is sent back depending on several factors.
  • Location tracking.GPS devices send real-time location information to monitor truck movements on a map.
  • Real-time patient monitoring.Medical equipment and devices monitor patient vital signs to quickly respond to emergencies.

Streaming ETL architecture may also involve a stream processing platform. An example of this streaming platform is Apache Kafka. Your source systems can publish data to the stream processing platform. Then, you can ingest the stream and process it. Finally, you can store it to the target.

Streaming ETL tools also exist to serve this purpose.

Batch Processing vs. Streaming Processing

Batch ETL vs. streaming ETL — what’s the difference?

The main differences between the two are the following:


Batch ETL

Streaming ETL
CoverageRows of data collected hours, days, or weeks beforeA single new record
Time of executionTriggered manually or in a scheduleWhen a new record is added
How fastDone within seconds, minutes, or hoursDone within milliseconds or seconds
PurposeIn-depth, periodic reporting and analysisFor real-time interaction with people or devices. Also, for near real-time analysis.

Companies that understand the differences use both approaches. So, converting all your batch ETL to streaming ETL may not be a wise choice.

Both approaches have their pros and cons. Check out the following section.

Batch and Streaming Processing Pros and Cons

Check out the table below before you convert your batch ETL to streaming ETL.


Pros
Cons
Batch ETLPerfect for volumes of data that needs posting, approval, and the like before processing.One row with invalid entries will halt the whole batch process.
Higher data quality compared to streaming ETL.As the volume of data increases, the processing time also increases. Worst, it may overlap with the next scheduled runtime.
Several solid and mature tools exist. This includes tools with batch API to handle a limited number of API calls to third-party apps.
Easier to design, test, and maintain.
Hiring new talents is easier because many data professionals know it.
Streaming ETLPerfect for just-in-time data from devices or apps.Can be complex to design, test, and maintain.
If an error occurs, only one record is affected.Some streaming ETL platforms require coding.
Depending on the tool and setup, streaming ETL can be highly-scalable and fault-tolerant.Fewer people know about it. Hiring is a challenge.
Outages during operations are more urgent compared to batch ETL.
Possible real-time processing of dirty data (duplicates, incomplete, etc.) if not handled in pipeline design.
Costly if the limited number of API calls to SaaS is exceeded.

Examples of ETL Batch Processing Using Skyvia

It’s dead easy to do ETL batch processing in Skyvia. ETL options you have in Skyvia are the following:

  • Data Replication. For creating copies of data from the cloud to relational databases or vice versa. It requires a source and a target.
  • Data Import. For basic ETL between 2 data sources in any direction. It also covers reverse ETL where a data warehouse is the source and a cloud app is the target.
  • Control and Data Flow. For complex ETL data pipelines, including any number of data sources, done reverse or not.

Our examples have a goal to extract data from Salesforce and load it to Azure SQL Database. And we’ll show you 2 ways to do it.

But first, we need to create 2 connections. These connections are needed by the 2 batch ETL examples.

Creating the Salesforce Connection

Here are the steps for the Salesforce connection:

1. ADD A NEW SALESFORCE CONNECTION

In Skyvia, click NEW. Then, click Connection.

Creating the Salesforce Connection for batch ETL processing in Skyvia 1

2. CHOOSE SALESFORCE

The following shows how to do this.

Creating the Salesforce Connection for batch ETL processing in Skyvia 2

3. CONFIGURE THE SALESFORCE CONNECTION

You need a valid Salesforce account or this step won’t work. So, check the configuration below.

Creating the Salesforce Connection for batch ETL processing in Skyvia 3

You’ll get a security key (OAuth token) after signing in. Then, click Test Connection.

Do not uncheck the Use Bulk API checkbox in Advanced Settings. This will use Batch API calls. And that will let you avoid reaching your API request limits and allocations.

Then, you’re ready to use this connection in a Skyvia package.

Creating the Azure SQL Database Connection

You need a valid Azure account to make this connection work. This also assumes that you have set up a fully-managed PaaS database and a table where the Salesforce records will be loaded.

You also need to whitelist Skyvia’s IP addresses in Azure. One way to do it is by enabling public network access with firewall rules. See below.

Creating the Azure SQL Database Connection for batch ETL processing in Skyvia 1

You will know the Skyvia IP addresses once you register in Skyvia.

So, here are the steps:

1. CREATE A NEW AZURE SQL SERVER CONNECTION

Click NEW. Then, click Connection.

2. CHOOSE SQL SERVER

Use the SQL Server connector as shown below.

Creating the SQL Server Connection for batch ETL processing in Skyvia 1

3. CONFIGURE THE AZURE SQL SERVER CONNECTION

You need to get the database credentials in Azure first. So, go to your Azure SQL database and click Connection Strings. You’re going to need the server’s name, User ID, and database name. See below.

Creating the Azure SQL Server Connection for batch ETL processing in Skyvia 1

Then, go to Skyvia and configure the SQL Server connection. Follow the instructions below using the Connection string information.

Creating the Azure SQL Server Connection for batch ETL processing in Skyvia 2

Test Connection will succeed if you use the correct credentials.

Batch ETL Example 1: Replicating Salesforce to Azure SQL Database

It’s time to build our easy example – a Skyvia Replication Integration. The following shows Salesforce to Azure SQL Server database replication.

Replicating Salesforce to Azure SQL Database in Skyvia 1

Then, test it by clicking Run in the top-right corner of the page. Use external tools like SQL Server Management Studio to check the replicated data.

Then, you can schedule this package to run at a set time. It can be a one-time or recurring schedule. It’s all up to you. See below.

Replicating Salesforce to Azure SQL Database in Skyvia 2

Batch ETL Example 2: Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database

Our next example is a more flexible ETL solution using a Control Flow and Data Flow.

We are going to use a simple example so this solution is easier to understand. And then you can use this knowledge to do complex Control Flows.

To do this, we will extract the Contacts from Salesforce. Then, add a new column as a transformation. And finally, store the results in Azure SQL.

We will use the dummy Contacts data created when you register to Salesforce. Check it out below.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database

So, we expect 7 names with initials present in Azure after we run the package.

Anyway, here are the steps:

1. CREATE A CONTROL FLOW

Click NEW. Then, click Control Flow.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 1

2. NAME THE CONTROL FLOW AND DRAG 2 ACTIONS

The initial configuration is shown below with a name and 2 actions.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 2

3. CONFIGURE THE ACTION COMPONENT

For simplicity, we will use a full load ETL to avoid duplicates. You can also do an incremental load in Skyvia but we won’t explain it here. So, the first action will remove all the records in the target table. Here’s how:

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 3

4. CONFIGURE THE DATA FLOW

Click the Data Flow component. And click Open Data Flow. Then, a new page will appear with a blank Data Flow. Then, do the following:

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 4

Then, configure the Source as shown below.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 5

Then, configure the Extend transformation.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 6

Then, you need to specify the expression for the new column.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 7

Finally, configure the Azure Target.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 8

Then, click Run.

The result of the batch ETL is shown below. Compare it to the Salesforce screenshot earlier.

Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database 9

This simple Control Flow can be expanded to have more Data Flows handling other Salesforce tables. You can make a parallel execution or run it in succession.

And you can also make a schedule to run based on your needs. The configuration is the same as the Replication package earlier.

Conclusion

Batch ETL processing is easy if you have the right tools.

If you plan to do a batch ETL project, why not try it with Skyvia? It’s easy and flexible for most ETL needs.

Edwin Sanchez
Edwin Sanchez
SQL Server & Power BI Developer
10 tip to overcoming Salesforce integration challenges