With the rise of Big Data over the last decade, more and more businesses started moving their applications to the cloud. It is estimated that over 90% of the entire data in today’s world has been generated in the last two years only. With the increase in such data volume, the velocity at which data is generated has also changed. As a result, new tools and applications need to be developed that are specially designed to cope with this ever-increasing volume and velocity of data.
Data warehousing has been a popular concept when it comes to generating insights and reporting on the application data. A typical data warehouse is a database application that is specially designed to handle analytical queries without affecting the performance of an application database. Such a data warehouse can be installed locally or on a public cloud vendor. However, a data warehouse application is quite resource-intensive and needs maintenance in order to be able to perform substantially. A few cloud data warehouse providers also encapsulate the responsibility of managing the data warehouse application and allow users to focus on the application only.
Snowflake is one such cloud provider that allows users to create data-driven insights using a multitude of application platforms such as a Data Warehouse, Data Lake, Data Engineering, etc. In order to move data into a Snowflake Data Warehouse, there are several Data Integration tools (also known as ETL tools) available. In this article, we will focus on some of the popular Snowflake ETL tools available and also discuss some of the key concepts while considering ETL with Snowflake. Some of the key benefits of using Snowflake are listed below.
- Data backup and retention
- Dynamic computing capabilities
- A Pay-As-You-Go billing model
- Driver support for your favorite programming language
- Numerous cloud-centric synergies and integrations
TABLE OF CONTENTS
- Snowflake ETL Technology
- ETL Tools for Snowflake
- Apache Airflow
Snowflake ETL Technology
Before diving deeper into Snowflake ETL concepts and Snowflake ETL best practices, let us first understand some key concepts. This will be helpful as we move through the article.
- ETL – Abbreviated as Extract Transform Load, this is the traditional technique of loading a data warehouse. In the Extract Transform and Load workflow, data from the source system is first extracted, transformed as per business requirements, and then loaded into a target system from where it can be used for further downstream analysis.
- ELT – Extract Load Transform, on the other hand, is a slightly modern technique that involves loading an exact copy of the source data to a staging or target location and then applying transformation logic on top of it. This allows data to be available in the target system in both formats – raw and curated. This has become available due to data warehouses having larger compute and processing power as compared to traditional data warehouses.
- Stage – Stage or data stage is a Snowflake object which points to an object store location (S3 bucket or equivalent) that allows for bulk file ingress via Snowpipe.
- Virtual Warehouse – The Snowflake compute resource that is used to process queries, load data, and other activities in Snowflake.
In addition to these, there are certain Snowflake ETL best practices that you might consider while implementing your data pipelines.
ETL Tools for Snowflake
Now that we have an idea about the differences between ETL and ELT concepts, let us understand how can we achieve those with Snowflake. By default, Snowflake offers tools to extract data from source systems, and there are also some third-party tools that allow users to build custom integrations with Snowflake. I have listed below some of the best ETL tools for Snowflake, however, the list is not exhaustive.
Skyvia is a cloud-based platform for data integration with different cloud integration services and tools to perform different data-related tasks. It is mostly oriented for ETL and reverse ETL tasks, but also provides tools for API-based integration. Skyvia supports over 80 different cloud applications and a number of most widely used databases.
- Supports reverse ETL in case if you need to load data from Snowflake to other sources.
- Extremely easy to use tools for ELT to Snowflake — no coding knowledge required.
- Large number of tools and products for different data-related use cases.
- Easy management of configured pipelines with sorting, filtering, search, etc.
- No web API to manage the pipelines themselves
- No phone support yet
integrate.io is a no-code web interface that allows users to connect with multiple database systems on the cloud. You can choose a source and a target data system based on the list of available connections. Once you have access to both the source and target data systems, you can create a mapping between the two and allow data transformations within the pipeline.
With Integrate.io users can easily start building their data pipelines. A pre-requisite to this would be to already have access to the source and target data systems. Once the access has been defined, it is just a matter of a few minutes to get the data moving between the systems.
- A no-code platform, which means you do not need to code how the pipeline would connect in the backend.
- User-friendly interface that allows customers to quickly get adapted.
- Can be integrated with a large number of data sources, including cloud and SaaS services.
- Sometimes it can be difficult to debug job failures.
- Jobs can turn out to be expensive if not properly optimized.
Apache Airflow is an open-source scheduling platform that allows users to schedule their data pipelines. It allows users to programmatically author data pipelines and manage them in a distributed fashion. Airflow has a concept of implementing data pipelines in Direct Acyclic Graphs, also known as DAGs. A DAG comprises of multiple individual tasks that can be arranged in a way such that a task dependency can be established and each task depends on the complement of its upstream tasks.
- It allows users to programmatically author data pipelines.
- Easy to manage and integrate with version control systems like Git.
- Very easy to build and manage data pipelines with Python.
- Open-source tool and can be installed on local servers or on the cloud.
- The tool is free of cost, however, the billing depends on where it has been deployed.
- Need some intermediate knowledge of Python.
- There is a small learning curve involved in getting the ball rolling with Apache Airflow.
Matillion is a cloud-native data integration tool that provides a rich user interface to develop data pipelines. There are two products that Matillion provides, Matillion Data Loader which allows users to move data from any service to the cloud, and Matillion ETL which allows users to define data transformations and build data pipelines on the cloud.
- There is a free version available using which you can load up to 1 million records.
- Quite a big community is available for general support and a great user interface.
- You can deploy it to any public cloud, e.g. AWS with Cloud Formation templates.
- Documentation can sometimes get a bit tricky.
- At the moment, there is no option to restart tasks from point-of-failure. The entire job needs to be restarted instead.
Stitch offers its users an extensible Data Integration platform that can be used to connect to a plethora of databases and other SaaS applications. Stitch provides an easy-to-use orchestration tool, with which you can monitor your data pipelines on the go. Some advanced and enterprise features include sending alerts to Slack, notifications to DataDog, etc. Stitch makes it extensible with the support of open-source project Singer taps and targets. With the help of the Singer project, you can start building your custom connectors as well.
- Provides extensibility by using open-source connectors.
- You can build your own integrations or get support from the community.
- Provides out-of-the-box API to handle data ingression with ease.
- For using the Singer taps and target, knowledge of python is required.
- As a SaaS offering, there is no free way to use the platform, however, you can try it for free for 14 days.
HevoData is a SaaS offering that allows users to create data pipelines without writing a single line of code. It allows users to perform data extraction, transformation, and loading from multiple data sources. HevoData offers integrations with various platforms like Google Drive, Salesforce, Snowflake, AWS, etc. HevoData provides a useful feature to anonymize data before moving it to final destinations or data warehouses.
- It provides integration with a lot of modern applications and databases.
- Easy to deploy and the support team offers much help.
- There is no way to organize or categorize your data pipelines.
- Sometimes throughput is not up to the mark.
In this article, we have reviewed some of the popular ETL tools available for Snowflake. As we now understand the concepts of ETL and ELT with Snowflake, it is easy for us as a user to choose how to integrate with Snowflake. Along with the native data connectors, there are various other third-party data integration tools that allow users to extract, load and transform data into Snowflake.
If you are looking to implement your own data integration solution with Snowflake, I would recommend you to check out Skyvia’s integration with Snowflake. It is one of the best Snowflake ETL tools, and it allows users to visually load data into Snowflake. To learn more about Skyvia, please visit the official documentation.