Check our new tutorial about ETL tools for Amazon Web Services (AWS). Get examples and best practices for using ETL in AWS.
Table of contents
- AWS Services List
- ETL in Amazon AWS
- AWS Glue ETL Advantages and Disadvantages
- Top 5 ETL tools for AWS
As the adoption of cloud computing began, more and more organizations are now looking forward to migrating their existing platforms to the cloud or building new infrastructure directly on the cloud. This allows businesses to focus more on their applications instead of managing the underlying infrastructure and resources. Another key point of building cloud applications is cost, where users benefit from a pay-as-you-go model, which in turn allows them to quickly scale up and down based on the traffic encountered without having to pay up-front.
ETL tools also came along the same way. There are various cloud-based companies that offer ETL tools in the cloud such as AWS, Azure, GCP, etc. In addition to these, there are also other companies that offer SaaS ETL tools like Skyvia, HevoData, Stitch, etc. Skyvia provides a universal cloud data platform that allows users to integrate data from various cloud platforms and bring them to a centralized data warehouse in Amazon. HevoData allows users to build data pipelines on the cloud, whereas Stitch offers users to build custom data integration plugins with its open-source Singer tap and targets concept.
There are pros and cons behind each of those services and tools. In this article, we will focus on what ETL in AWS means, different ways how to load data and insert it into Amazon Redshift schema, and also explore how you can benefit from them. We will also discuss briefly the various types of ETL-related services and what should be your choice based on a use case.
AWS Services List
AWS provides various services that customers can use to design, develop and run their applications. These services range from infrastructure such as provisioning an EC2 instance to fully-managed applications like S3, where you do not have to provision anything. As of writing this article, there are over 200 services that Amazon provides, however, let’s discuss a few of the most relevant ones.
AWS Redshift — Amazon Redshift is a fully-managed, petabyte-scale cloud data warehouse that enables data professionals to query large volumes of data in a very efficient way. It is an enterprise-grade data warehouse solution that stores data in a column-oriented pattern, useful for querying over large aggregates. You can use Amazon Redshift ETL to copy from multiple s3 files and query your data using Redshift Spectrum. Amazon Redshift runs on a cluster to distribute its computation into a leader and compute nodes. Since Redshift is based on an MPP (Massively Parallel Processing) architecture, data transfer from S3 into Redshift is executed in parallel and often quite fast. It offers two types of scaling possibilities — Classic Resize and Elastic Resize. While Redshift supports cloud data warehousing workloads, it offers two categories for pricing — Dense Compute and Dense Storage. Based on your requirements, you can choose either one of them to proceed with. Amazon Redshift also provides a bulk load feature using which you can load files multiple files from S3 to Redshift ETL.
AWS Simple Storage Service (S3) — Amazon S3, abbreviated as Simple Storage Service is a low-cost object storage system that allows customers to store files in the cloud. This is one of the oldest services provided by AWS and it is quite popular due to its ease of use and simplicity. It allows developers to store and retrieve files from a highly scalable and reliable data store. Amazon S3 is billed by the amount of data you store, i.e. you pay for only what you use. This makes the S3 one of the cheapest solutions to store data on the cloud. To learn more about Amazon S3, please refer to the official documentation.
AWS Elastic Compute Cloud (EC2) — Amazon Elastic Compute Cloud, also known as EC2 is a compute service provided by Amazon that allows users to spin up virtual machines on the cloud. It allows customers to run Windows, Linux, or MacOS virtual machines with resizable compute and memory capacities. EC2 instances can be scaled up and down and can also be launched in one or more regions across the globe. It allows developers to build highly resilient and fault-tolerant applications by leveraging Availability Zones within the regions.
AWS Relational Database Service (RDS) — As the name suggests, Amazon RDS allows users to run fully managed databases on the cloud. RDS currently allows users to run multiple database engines such as MySQL, PostgreSQL, MariaDB, SQL Server, and Oracle. It also has its proprietary database engine Aurora which is highly scalable and resilient. You can use RDS to build database applications and scale them up and down based on your workloads. You can also deploy multi-node clusters of databases that allow your application to be fault-tolerant in case one of your nodes go down. Amazon ETL RDS can be used to extract and load data into your RDS instances from Amazon S3 buckets.
AWS Lambda — AWS Lambda is a serverless compute service that allows users to write and execute custom functions on the go. As a developer, you can just write your code and submit it to AWS Lambda and it will be executed as-is. It is a great tool, especially when developing microservices architecture as you can build a lambda function for each of your services and design a loosely coupled architecture.
ETL in Amazon AWS
ETL, abbreviated as Extract Transformation and Load is a method of extracting data from source systems, applying transformation logic to the extracted data, and then loading it into a target data system. Traditionally, ETL has been performed using on-premise ETL applications and data warehouses. However, with the rise of cloud computing, more and more ETL workloads are being shifted to the cloud. A key advantage of starting ETL on the cloud is scalability and cheaper solutions as compared to their on-premise counterparts.
ETL in AWS can be performed by using a combination of services. You can plug in multiple services with one another and then perform extract transform and loading. When it comes to ETL in Amazon, AWS Glue is the go-to tool of choice. AWS Glue is a fully-managed data integration platform that allows users to write custom ETL scripts. You can choose to use AWS Glue to load and manipulate data within your AWS environment, trigger event-driven data pipelines by leveraging AWS Lambda, build a data catalog that stores metadata of multiple data sources in the AWS ecosystem, or design ETL pipelines visually without writing any code. You can also build materialized views that let you combine data from multiple sources and access it from a target data store. Let’s take a look at how to achieve ETL in AWS.
The above figure depicts the GUI environment from AWS Glue where you can design your ETL workflows graphically using nodes and link them together.
AWS Glue ETL Advantages and Disadvantages
Every tool comes with some pros and cons and AWS Glue is no different. While it offers amazing ETL capabilities in the cloud, let’s look at some of the pros and cons of this tool.
Advantages of Amazon Glue ETL Tool
- Serverless ETL — AWS Glue is a serverless ETL platform. This means developers do not have to manage any server or infrastructure before running an ETL job in Glue. Begin serverless, also implies that the service is going to be relatively cheaper than a traditional ETL tool.
- GUI support — AWS Glue has a GUI that allows developers and other users to build data transformation pipelines without having to code. This is helpful as someone who doesn’t have a fair knowledge of Python or PySpark can also easily design a basic Glue job from scratch. Glue automatically generates PySpark code in the background as you start designing your job in the GUI.
- Integration with other AWS Services Since AWS Glue is a native service from AWS, it offers better integration with other services within the ecosystem, such as Amazon S3, RDS, Redshift, etc.
Disadvantages of AWS Glue
- No control of resources — Amazon Glue offers only three types of instances — Standard, G.1X (Memory Intensive), and G.2X (Machine Learning). There is no option to tweak your memory or CPU requirements and tailor it to your needs.
- Only supports Python and Scala — At the moment of writing this article, Amazon Glue only supports Python and Scala scripts. If your company already has ETL jobs written in some other programming language, then it is going to be a lot of effort to migrate them to python and then deploy them to AWS Glue.
- Steep Learning Curve — Since AWS Glue is a native service from AWS, it offers better integration with other services within the ecosystem, such as Amazon S3, RDS, Redshift, etc.
- Compatibility Issues — AWS Glue has good support for data sources within the AWS ecosystem, however, if you have your data on non-AWS systems, there will be limited support and you need to somehow bring in your data in AWS first before running a Glue job.
Top 5 ETL tools for AWS
Designing your ETL pipeline and choosing the tool is always a matter of concern for data architects. There are different options available that suit a variety of workloads. AWS Glue is the native way to perform ETL workloads in AWS, however, in this section, we will discuss a list of Amazon Redshift ETL tools that you can use with AWS. Some of the third-party ETL tools provide a few benefits that are not available with AWS Glue.
Skyvia is a cloud data platform with a number of GUI tools for different data-related tasks. It includes tools for ELT and ETL scenarios and can load data to and from AWS. It supports Amazon Redshift, Amazon RDS, and Amazon S3 services as well as many other, non-Amazon cloud apps and databases. Skyvia suits both for simple cases when you need to just copy your data to Amazon Redshift for data analysis or perform Amazon CSV export and advanced scenarios with complex data pipelines with transformations, involving multiple data sources. You can use it for one-time data loading jobs as well as for periodic data sync.
Fivetran is a cloud-based ETL or ELT tool using which you can extract, transform and load your data into the AWS environment. A use case of using Fivetran would be when your data source is outside the AWS ecosystem, you can connect to the data source using the connectors provided by Fivetran and get your data loaded. As one of the best data pipeline tools in the market, Fivetran allows scheduling, error discovery, incremental data loading and routing. As a SaaS offering, you no longer need to install any piece of software on your machine to get started with Fivetran.
Amazon Kinesis Data Streams is one the best ETL tools for AWS as it allows you to ingest and process a huge amount of data in real-time. You can build real-time data solutions using Kinesis Streams and provide live data into your data warehouse. Amazon Kinesis reads data from a stream such as a Kafka topic or an SQS queue. As soon as the data is available it reads, processes, and sends it to Redshift or other services as defined.
As discussed in the previous section, AWS Glue is a serverless ETL platform provided by AWS. You can use it to build and run PySpark jobs at ease without having to provision any physical infrastructure or cluster. Glue also provides a drag-and-drop interface that allows you to design your ETL job without writing a single line of code. AWS Glue is best suited for use cases when your data is already available within the AWS ecosystem such as an S3 bucket and you would want to use it to transform and load data to other systems such as Amazon RDS or Amazon Redshift.
AWS Data Pipeline
AWS Data Pipeline is another addition to the list of ETL tools from Amazon. It allows users to design data pipelines using a simple user interface that allows users to drag and drop various source and target nodes into a canvas and define connection properties. It is a batch pipeline and can be used to move data between various services within the AWS ecosystem as well as from on-premise locations. Since it is a batch data processing tool, it is best suited for workloads that do not have real-time reporting requirements. You can use it to move data from an S3 bucket to RDS or between multiple S3 buckets.
These are some of the best ETL tools for AWS, however, depending on your choice, you might want to evaluate the one that suits all your needs. While designing data pipelines it is important to follow some of the best practices that will help you scale your data pipelines and be cost-efficient.