Data has become a staple to many modern businesses. We have become more data-driven in critical decisions. So, we collected more data. In fact, global data creation is projected to grow to more than 180 zettabytes by 2025. That’s equivalent to around 35,667,673,068,966 DVDs. And that’s a lot! So, there’s an ever-growing trend to process them faster. You can do it with ELT (Extract, Load, Transform). Or use ETL. This article will focus on Batch ETL. Then, we will compare it with its close cousin, streaming ETL.
ETL batch processing is a more traditional but solid approach. Meanwhile, stream processing is the newer kid on the block. So, which of the two is the one you need? Find out in this in-depth comparison.
Table of contents
- What is ETL (Extract, Transform, Load)?
- Batch ETL Processing: How it Works
- Streaming ETL Processing
- Batch Processing vs. Streaming Processing
- Batch and Streaming Processing Pros and Cons
- Examples of ETL Batch Processing Using Skyvia
Let’s dive in.
What is ETL (Extract, Transform, Load)?
ETL is a data pipeline that involves a three-step process: extract, transform, and load.
The following shows how it works:
This 3-step process goes like this:
First, extract the data from various sources. It can be flat files like CSV or Excel. Or from relational databases like Oracle or SQL Server. And it can also be third-party sources like SalesForce or HubSpot.
Then, transform the data by cleaning, filtering, summarizing, and more.
Finally, load. So, this type of data processing has reached maturity. And it’s typical to use ETL batch processing in the earlier days. Here are some specific examples of ETL batch processes today:
ETL is a since the 70s that involves a three-step process: extract, transform, and load.
- Processing of monthly sales data into a data warehouse.
- Gathering daily fingerprint scans with date and time from a biometrics machine. This is useful for employee time and attendance.
- Combining client profiles for two merging companies.
But ETL can also be real-time or streaming ETL. This is perfect for data that needs to be processed immediately.
The following section describes further how a batch ETL works.
Batch ETL Processing: How it Works
Batch ETL definition is simple. It refers to data processing in batches that covers a time period.
The following describes a batch ETL processing:
- Volumes of data are extracted and processed from the source. Extraction is possible via a query language like SQL or a bulk or batch API for SaaS like Salesforce.
- Extraction covers a batch window or a time period.
- Processing is triggered manually or scheduled.
Let’s have some examples.
Above are 2 ETL batch processing examples that happen through a stream of time.
The first one goes like this:
- The administrator ran the batch process manually.
- He extracted server logs like CPU utilization, disk space, RAM usage, and more.
- Coverage is within a batch window – logs from last night. Could be from 7:00 PM to 11:59 PM.
Meanwhile, at the start of the next day, the ETL scheduler triggered another batch process. And here’s how it worked:
- The process was automated to run every first of the month.
- Extracted data covers posted accounting entries from July 1 to July 31 of the current year.
So, data collection occurs first in batch ETL. Then, the extraction takes place at a set time or as needed. You can perform these sorts of ETL batch processes using modern ETL tools.
But batch ETL can be challenging as you will see later. First, let’s describe what streaming ETL is.
Streaming ETL Processing
So, what is streaming ETL?
Streaming ETL is a data pipeline that starts as soon as a record is available in the source. It’s sometimes called real-time ETL or stream processing. Though it’s not literally real-time. Because it takes a fraction of a second to extract, transform, and load the whole thing.
Real-world examples of streaming ETL include:
- Credit card fraud detection.When you swipe your credit card, a transaction is sent to a fraud detection system for analysis. Then, a deny or approve status is sent back depending on several factors.
- Location tracking.GPS devices send real-time location information to monitor truck movements on a map.
- Real-time patient monitoring.Medical equipment and devices monitor patient vital signs to quickly respond to emergencies.
Streaming ETL architecture may also involve a stream processing platform. An example of this streaming platform is Apache Kafka. Your source systems can publish data to the stream processing platform. Then, you can ingest the stream and process it. Finally, you can store it to the target.
Streaming ETL tools also exist to serve this purpose.
Batch Processing vs. Streaming Processing
Batch ETL vs. streaming ETL — what’s the difference?
The main differences between the two are the following:
|Coverage||Rows of data collected hours, days, or weeks before||A single new record|
|Time of execution||Triggered manually or in a schedule||When a new record is added|
|How fast||Done within seconds, minutes, or hours||Done within milliseconds or seconds|
|Purpose||In-depth, periodic reporting and analysis||For real-time interaction with people or devices. Also, for near real-time analysis.|
Companies that understand the differences use both approaches. So, converting all your batch ETL to streaming ETL may not be a wise choice.
Both approaches have their pros and cons. Check out the following section.
Batch and Streaming Processing Pros and Cons
Check out the table below before you convert your batch ETL to streaming ETL.
|Batch ETL||Perfect for volumes of data that needs posting, approval, and the like before processing.||One row with invalid entries will halt the whole batch process.|
|Higher data quality compared to streaming ETL.||As the volume of data increases, the processing time also increases. Worst, it may overlap with the next scheduled runtime.|
|Several solid and mature tools exist. This includes tools with batch API to handle a limited number of API calls to third-party apps.|
|Easier to design, test, and maintain.|
|Hiring new talents is easier because many data professionals know it.|
|Streaming ETL||Perfect for just-in-time data from devices or apps.||Can be complex to design, test, and maintain.|
|If an error occurs, only one record is affected.||Some streaming ETL platforms require coding.|
|Depending on the tool and setup, streaming ETL can be highly-scalable and fault-tolerant.||Fewer people know about it. Hiring is a challenge.|
|Outages during operations are more urgent compared to batch ETL.|
|Possible real-time processing of dirty data (duplicates, incomplete, etc.) if not handled in pipeline design.|
|Costly if the limited number of API calls to SaaS is exceeded.|
Examples of ETL Batch Processing Using Skyvia
It’s dead easy to do ETL batch processing in Skyvia. ETL options you have in Skyvia are the following:
- Data Replication. For creating copies of data from the cloud to relational databases or vice versa. It requires a source and a target.
- Data Import. For basic ETL between 2 data sources in any direction. It also covers reverse ETL where a data warehouse is the source and a cloud app is the target.
- Control and Data Flow. For complex ETL data pipelines, including any number of data sources, done reverse or not.
But first, we need to create 2 connections. These connections are needed by the 2 batch ETL examples.
Creating the Salesforce Connection
Here are the steps for the Salesforce connection:
1. ADD A NEW SALESFORCE CONNECTION
In Skyvia, click NEW. Then, click Connection.
2. CHOOSE SALESFORCE
- Type Salesforce in the Search box.
- Click the Salesforce icon.
3. CONFIGURE THE SALESFORCE CONNECTION
You need a valid Salesforce account or this step won’t work. So, check the configuration below.
- Name your connection.
- Select Production environment.
- Select OAuth 2.0.
- Click Sign-in with Salesforce.
- Click Test Connection.
- Click Save Connection.
You’ll get a security key (OAuth token) after signing in. Then, click Test Connection.
Do not uncheck the Use Bulk API checkbox in Advanced Settings. This will use Batch API calls. And that will let you avoid reaching your API request limits and allocations.
Then, you’re ready to use this connection in a Skyvia package.
Creating the Azure SQL Database Connection
You need a valid Azure account to make this connection work. This also assumes that you have set up a fully-managed PaaS database and a table where the Salesforce records will be loaded.
You also need to whitelist Skyvia’s IP addresses in Azure. One way to do it is by enabling public network access with firewall rules. See below.
You will know the Skyvia IP addresses once you register in Skyvia.
So, here are the steps:
1. CREATE A NEW AZURE SQL SERVER CONNECTION
Click NEW. Then, click Connection.
2. CHOOSE SQL SERVER
- Type SQL Server in the Search box.
- Click the SQL Server icon.
3. CONFIGURE THE AZURE SQL SERVER CONNECTION
You need to get the database credentials in Azure first. So, go to your Azure SQL database and click Connection Strings. You’re going to need the server’s name, User ID, and database name. See below.
Then, go to Skyvia and configure the SQL Server connection. Follow the instructions below using the Connection string information.
Test Connection will succeed if you use the correct credentials.
Batch ETL Example 1: Replicating Salesforce to Azure SQL Database
It’s time to build our easy example – a Skyvia Replication Integration. The following shows Salesforce to Azure SQL Server database replication.
- Name your Replication package.
- Specify “Salesforce” as the Source Connection.
- Specify “Azure-MSSQL” as the Target Connection.
- Select the Salesforce objects you want to replicate.
- Validate and Save your package.
Then, test it by clicking Run in the top-right corner of the page. Use external tools like SQL Server Management Studio to check the replicated data.
Then, you can schedule this package to run at a set time. It can be a one-time or recurring schedule. It’s all up to you. See below.
Batch ETL Example 2: Using Skyvia Data Flow to Load Salesforce Data to Azure SQL Database
We are going to use a simple example so this solution is easier to understand. And then you can use this knowledge to do complex Control Flows.
To do this, we will extract the Contacts from Salesforce. Then, add a new column as a transformation. And finally, store the results in Azure SQL.
We will use the dummy Contacts data created when you register to Salesforce. Check it out below.
So, we expect 7 names with initials present in Azure after we run the package.
Anyway, here are the steps:
1. CREATE A CONTROL FLOW
Click NEW. Then, click Control Flow.
2. NAME THE CONTROL FLOW AND DRAG 2 ACTIONS
- Name your Control Flow (in the screenshot below the name is Salesforce-Contacts-to-Azure).
- Drag and drop an Action Component between Start and Stop.
- Drag and drop a Data Flow between Action and Stop.
3. CONFIGURE THE ACTION COMPONENT
For simplicity, we will use a full load ETL to avoid duplicates. You can also do an incremental load in Skyvia but we won’t explain it here. So, the first action will remove all the records in the target table.
- Click the Action icon.
- Name the Action.
- Select the Azure-MSSQL connection.
- Select Execute Command for the Action.
- Specify the Command text.
4. CONFIGURE THE DATA FLOW
- Click the Data Flow component.
- Click Open Data Flow. Then, a new page will appear with a blank Data Flow.
- Drag and drop a Source.
- Drag and drop an Extend transformation.
- Drag and drop a Target.
- Connect the arrows to each component.
Then, configure the Source as shown below:
- Click the Source icon.
- Name the Source.
- Select the Salesforce connection.
- Select Execute Query for the Action.
- Click Open in Editor. Find the Contacts table and click it.
- Drag the Contacts.* to Result Fields box, then click Apply.
Then, configure the Extend transformation:
- Click the Extend transformation icon.
- Name the Extend transformation.
- Click the pencil icon in Output Schema.
Then, you need to specify the expression for the new column:
- Click + to add a new column.
- Name the new column.
- Enter the expression to extract name initials using the left function.
- Click Apply.
Finally, configure the Azure Target:
- Click the Target icon.
- Name the Target.
- Select the Azure-MSSQL connection.
- Select Insert for the Action.
- Click dbo.[salesforce-contacts] or the name of your table in Azure.
- Click the pencil icon in Parameters. A pop-up window will appear.
- Click the Auto Mapping icon then click Apply.
- Click Create in the upper right corner to Save your work.
Then, click Run.
The result of the batch ETL is shown below. Compare it to the Salesforce screenshot earlier.
This simple Control Flow can be expanded to have more Data Flows handling other Salesforce tables. You can make a parallel execution or run it in succession.
And you can also make a schedule to run based on your needs. The configuration is the same as the Replication package earlier.
Batch ETL processing is easy if you have the right tools.
If you plan to do a batch ETL project, why not try it with Skyvia? It’s easy and flexible for most ETL needs.