Table of contents
- Introduction to ETL with SSIS
- History of SSIS
- What Is the SSIS Package?
- Cons of SSIS
- Best SSIS Alternatives
- Azure Data Factory
- AWS Glue
Introduction to ETL with SSIS
The ETL is a generalized term for a process that does Extract-Transform-Load. Organization gets data from different sources and formats. It is not feasible to build an application that can consume data from any source or format. Therefore, the ETL tools do the following tasks.
- Provide a uniform interface for data sources and destinations.
- Apply transformation logic on the source data based on target data source requirements.
- Handle multiple data sources, formats in a single workflow.
- Require error handling logic so that they can eliminate bad data without any user intervention.
- Perform data transformation such as sorting, filtering, grouping aggregation, cleaning data, removing duplicates.
What Is SSIS (SQL Server Integration Services)?
The SQL Server Integration Services, aka SSIS, is an ETL tool that can embed complex business logic, load data from various data sources into SQL Server, Sharepoint List destination, flat files, etc. It is a part of the Microsoft SQL Server database software family. The SSIS contains workflows for performing the task in serial or parallel mode. It can create constraints for different execution paths based on the task success, failure, or completion.
History of SSIS
It has a long history of improvements over time in terms of features, reliability, speed, and stability. Many of us can testify this because we managed to use the Data Transformation Services (DTS) in SQL Server 7 before it became SSIS.
- SQL Server 7 and 2000: Microsoft introduced Data Transformation Services (DTS) in SQL Server version 7 and 2000.
- SQL Server 2005: SQL Server 2005 got Integration Services (SSIS), a revamp of earlier DTS with new looks and enhanced features.
- SQL Server 2008: In this version, SSIS has more data sources and performance improvements for handling more significant amounts of data with minimum resources.
- SQL Server 2012: It is a milestone for SSIS as it includes many concepts such as the project deployment model. The project deployment model consists of a project file in the ISPAC extension, a self-contained deployment unit, and information about ETL packages with parameters stored in the project.
- SQL Server 2014: This version includes new data transformation tasks, source and destination connections using an additional SQL Server feature pack.
- SQL Server 2016: The SQL Server 2016 version includes an Incremental Package Deployment feature, which allows you to deploy one or multiple packages instead of deploying the whole project. Earlier versions required full package deployment only. Note: Refer to Microsoft docs for detailed information on the new features in SSIS 2016.
- SQL Server 2017: This version includes scales-out for distributing SSIS package execution across multiple worker computers, Integration Services on Linux, connecting to the OData feeds for Microsoft Dynamics AX Online. Note: Refer to What’s New in Integration Services in SQL Server 2017 for more details.
- SQL Server 2019: This version has improved file operations such as Perform file operations on Local File System, Azure Blob Storage, and ADLS (Azure Data Lake Storage) Gen2. Note: Refer to What’s new in SQL Server 2019 (15. x) for more details.
What Is the SSIS Package?
The SSIS package is a collection of the necessary components for a specific ETL task. These components are the following:
- Source and destination connections
- Control flow
- Data flow
- Variables and parameters
- Event handler
The below diagram depicts the high level of ETL packages in SSIS.
Key features of SSIS
- Source and destination connections: The SSIS consists of multiple in-built sources and destination connectors. These connectors can connect to the flat files (TXT, CSV, XLS), Relational databases (SQL Server, Oracle, MySQL, etc.), OLEDB, ODBC, Azure Connections, WMI, SMO, FTP, Mail servers etc. You can use connectors such as SQL Server Destination instead of OLE DB to load the data faster.
- SSIS transformation tasks: The SSIS has multiple data transformation tasks such as Row transformation, Split and Join transformations, Aggregate transformation, Data Conversion Transformation, Conditional Split Transformation, Cache Transform Transformations, Auditing Transformations, Fuzzy Grouping, and Lookup SSIS Transformations and Custom Transformations with quick configurations and deployment.
- Graphical interfaces and workflow: The SSIS is a workflow-oriented tool that defines control and data flow tasks using precedence constraints. It includes a friendly GUI tool using SQL Server Data Tools to define tasks without any difficulty. You can also monitor execution and view central logs for all packages stored in SQL Server.
- Package deployment and reuse: The SSIS allows a package deployment model for deploying ETLs across multiple servers, migrating them, and reuse them as required.
- Comprehensive documentation and support.
Cons of SSIS
- You cannot develop SSIS packages on Linux. But you can deploy them on SQL Server 2019 for Linux and run them from there. However, there will be certain features not available on Linux.
- The SSIS requires developer experience, complex coding for applying business logic. Except for cases when you need simple loading and transformation of CSV and writing rows in SQL Server.
- It requires Microsoft SQL Server Management Studio for viewing logs or SQL Server Data Tools for viewing execution logic. It isn’t easy to fetch information using the command-line tools.
- It requires a complex transformation process for the JSON data.
- The SSIS memory consumption can be quite high when working with multiple packages in parallel.
Best SSIS Alternatives
There are various SSIS alternatives that you can use for building ETL packages. Among such ETL tools are the following:
- Talend ETL
- Azure Data Factory
- AWS Glue
Skyvia is a cloud ETL and ELT data integration tool that is compatible with both on-premises and cloud sources. Skyvia supports a wide variety of cloud apps, databases and data warehouses.
Skyvia offers several advanced data integration tools described shortly below.
- For performing ETL operations, Skyvia offers an Import tool with advanced mapping settings for data transformations. It preserves all relations between the imported files, tables, or objects; supports INSERT, UPSERT, UPDATE and DELETE opeartions; offers detailed logging, scheduling, and error handling functionality.
- For performing ELT operations, Skyvia offers a Replication tool. Skyvia Replication helps to extract data from a source and load it into target as is. Unlike Import tool, Replication does not support loading data of different structure in source and target or using any custom mapping. However, it is much simpler to configure. It does not require a database with a prepared schema — it can create database tables for cloud data automatically.
- For more complex scenarios, Skyvia offers Data Flow. It helps to design complex data pipelines involving multiple data sources. It combines data from several sources, loads data to multiple destinations, perfoming advanced multistage transformations on the way to destinations. Plus you have a visual diagram of the whole process, which is quite convenient.
Talend is an open-source ETL data integration tool that is compatible with both on-premises and cloud sources.
- It has a user-friendly interface with a straightforward design of jobs.
- Talend has streamlined data warehouse management in comparison with SSIS ETL.
- It runs on the Java platform and utilizes Java for advanced scripting components.
- It utilizes JDBC connections to database systems and provides better error management.
- Talend uses metadata to store the connection strings for XML, Excel, Web Services, FTP, etc.
- In the Talend, we define ETL as jobs, while SSIS defines them as ETL packages.
- Talend offers scheduling and logs and error services similar to the SSIS package.
Python is an open-source machine language that consists of a considerable number of its packages for executing tasks with minimum coding. Python is a code-based language to define your logic and utilize the frameworks for performing ETL tasks compared with the SSIS packages. However, you require coding experience to use it for various data migrations.
Azure Data Factory
- Azure Data Factory is a browser-based ETL having robust scheduling and monitoring features.
- You do not need to manage infrastructure for the Azure Data Factory, and you can start building your ETLs as pay-as-you-go subscriptions.
- ADF supports aggregations, fuzzy lookups, derived columns, and other visually designed data transformations, similar to SSIS.
- The ADF data flows and databricks have spark clusters that transform and process analytics workloads.
- You can migrate your SSIS packages in the Azure Data Factory or execute them using integration service runtime.
- Azure Data Factory supports event-based, tumbling window triggers, scheduled batch triggers.
- It supports over 90+ in-built sources for processing complex data requirements. You can use third-party or customer connectors for processing JSON, REST APIs.
- AWS Glue is Amazon Web Services’ fully managed ETL for big data and data analysis.
- It is also a serverless data integration service having capabilities required for data integration from various data sources.
- It offers both visual and code-based interfaces for both beginners and experienced data professionals.
- AWS glue has automatic crawling functionality that identifies data formats and suggests schema tables for storing your data.
- You do not need to manage any servers in the AWS glue serverless architecture.
- It supports event-driven ETLs so that your process can automatically run in case of a specific event like the arrival of new data files.
- AWS Glue supports unified catalogs that help you find relevant data from multiple data sources in a single pane.
Each ETL tool is different in terms of its functionality, data sources, transformations, and visualization. You should compare the products as per your feature requirements and accessibility.
I would recommend you to explore the Skyvia data integration tool that fits most of the requirements. It offers a free plan for you to explore the product, and it includes 5k records per month, +100k CSV export and import, scheduling once a day and 2 scheduled packages. Visit Skyvia for more details.