In this article, we are going to learn about Amazon Redshift and how to work with CSV files. We will see some of the ways of data import into the Redshift cluster from S3 bucket as well as data export from Redshift to an S3 bucket. This article is written for beginners and users of intermediate level and assumes that you have some basic knowledge of AWS and Python.
Table of contents
- How to Export and Import CSV Files into Redshift in Different Ways
- How to Load CSV File into Amazon Redshift
- How to Unload CSV from Redshift
- Conclusion
How to Export and Import CSV Files into Redshift in Different Ways
Modern businesses tend to generate a lot of data every day. Once the data is generated, it is required to be stored and analyzed so that strategic business decisions can be made based on the insights gained. In today’s world, where more and more organizations are shifting their infrastructure to the cloud, Amazon Web Services, also known as AWS, provides a fully managed cloud data warehousing solution, which is Amazon Redshift.
Amazon Redshift is a fully managed data warehouse on the cloud. It supports Massively Parallel Processing Architecture (MPP), which allows users to process data parallelly. It allows users to load and transform data within Redshift and then make it available for the Business Intelligence tools.
CSV files are a very common and standard format of flat files in which columns and values are separated by a comma. Reading and storing data in CSV files are very simple, they have been used in the industry for over a few decades now. You can see a sample CSV file below.
In this article, you will learn various ways of data import/export from CSV to Redshift and vice versa.
How to Load CSV File into Amazon Redshift
Since CSV is one of the most popular forms of dealing with data in flat files, there are many tools and options to work with such CSV files. As such, there are different ways of how CSV files can be imported and exported from Redshift as well. You will learn about these methods in the later section as follows.
Load Data from Amazon S3 to Redshift, Using COPY Command
One of the most common ways to import data from a CSV to Redshift is by using the native COPY command. Redshift provides a COPY command using which you can directly import data from your flat files to your Redshift Data warehouse. For this, the CSV file needs to be stored within an S3 bucket in AWS. S3 is abbreviated for Simple Storage Service, where you can store any type of files. The following steps need to be performed in order to import data from a CSV to Redshift using the COPY command:
- Create the schema on Amazon Redshift.
- Load the CSV file to Amazon S3 bucket using AWS CLI or the web console.
- Import the CSV file to Redshift using the COPY command.
- Generate AWS Access and Secret Key in order to use the COPY command.
In the next section, you will see a few examples of using the Redshift COPY command.
REDSHIFT COPY COMMAND EXAMPLES
First you can create a cluster in Redshift and second create the schema as per your requirements. I will use the same sample CSV schema that you’ve seen in the previous section. In order to create the schema in Redshift, you can simply create a table with the following command.
The next step is to load data into an S3 bucket which can be done by either using the AWS CLI or the web console. If your file is large, you should consider using the AWS CLI.
Now when the CSV file is in S3, you can use the COPY command in Redshift to import the CSV file. Head over to your Redshift query window and type in the following command.
COPY table_name
FROM 'path_to_csv_in_s3'
credentials
'aws_access_key_id=YOUR_ACCESS_KEY;aws_secret_access_key=YOUR_ACCESS_SECRET_KEY'
CSV;
Once the COPY command has been executed successfully, you receive the output as in the above screen print. Now, you can query your data using a simple select statement as follows.
Sometimes, it might be that you do not want to import all the columns from the CSV file into your Redshift table. In that case, you can specify the columns while using the COPY command, and data only from those columns will be imported into Redshift.
As you can see in the above figure, you can explicitly mention names of the commands that need to be imported to the Redshift table.
REDSHIFT COPY COMMAND TO IGNORE HEADER FROM TABLE
Another important scenario while importing data from CSV to Redshift using the COPY command is that your CSV file might contain a header and you do not want to import it. In other words, you want to ignore the header from the CSV file from being imported into the Redshift table. In such a case, you need to add a specific parameter IGNOREHEADER to the COPY command and specify the number of lines to be ignored. Usually, if you just want to ignore the header, which is the first line of the CSV file, you need to provide the number as 1.
Auto Import Data into Amazon Redshift with Skyvia
Skyvia is a third-party cloud-based solution, which helps to automate data import from CSV to Amazon Redshift painlessly on a recurring basis. To start the process, simply sign up to the platform.
To accomplish the process in Skyvia, follow these 3 simple steps:
- Set up an Amazon Redshift connection;
- Configure data import and mapping settings between CSV file and Redshift;
- Schedule data migration
CONNECTION SETUP
Select Amazon Redshift among the list of data warehouses supported by Skyvia. In the opened Redshift connection window, enter the required parameters, which are Server, Port, User ID, Password and Database. You also need to click Advanced Settings and set parameters for connecting to Amazon S3 storage service. Among them are S3 region to use and either AWS Security Token or AWS Access Key ID and AWS Secret Key. Afterwards, check whether the connection is successful and click Create. You have completed the first step and connected to Amazon Redshift.
PACKAGE SETTINGS AND MAPPING
- Open an import package, select CSV as source and Redshift connection as target.
- Proceed with adding a task to the package. You are free to add as many tasks as you need. Skyvia allows performing several import tasks in one package and, thus, importing several CSV files to Redshift in a single import operation.
- In the task editor, upload a prepared CSV file. You are able to upload both CSV files from your PC or from a file storage service like Dropbox, Box, FTP, etc. As soon as you uploaded a CSV file, Skyvia displays a list of detected columns and allows you to explicitly specify column data types.
- Next, select an object in Redshift the data will be loaded to and choose an operation type.
- Columns with the same names in CSV and Redshift are mapped automatically. Map all other required source columns to target ones, using expressions, constants, lookups, etc. and save a task.
- In the package, you will see a saved task. Add another task in case you have another CSV file. Read more about CSV import to Redshift.
JOB AUTOMATION
Automate uninterrupted data movement from CSV to Redshift on a regular basis by setting schedule for your import package. Click Schedule and enter all required parameters in the Schedule window.
For the first time we recommend that you run your package manually to check how successful your package has been executed. If some of your columns in source and target are mapped incorrectly, you will see errors in your runs and will be able to update mapping settings. Moreover, Skyvia can send error notifications to your email.
Load Data from S3 to Redshift, Using Python
Python is one of the most popular programming languages in the modern data world. Almost every service on AWS is supported with the python framework, and you can easily build your integrations with it. We can use Python to build and connect to these services using libraries that are already available. In the following section, you will learn more about loading data from S3 to Redshift using python.
In order to be able to connect to Redshift using python, you need to use a library – “psycopg2”. This library can be installed by running the command as follows.
pip install psycopg2
Once the library is installed, you can start with your python program. You need to import the library into your program as follows and then prepare the connection object. The connection object is prepared by providing the hostname of the Redshift cluster, the port on which it is running, the name of the database and the credentials to connect to the database.
Once the connection is established, you can create a cursor that will be used while executing the query on the Redshift cluster.
In the next step, you need to provide the query that needs to be executed to load the data to Redshift from S3. This is the same query that you have executed on Redshift previously.
Once the query is prepared, the next step is to execute it. You can execute and commit the query by using the following commands:
cursor.execute(query)
conn.commit()
Now, you can go back to your Redshift cluster and check if the data has been copied from the S3 bucket to the Redshift cluster.
How to Unload CSV from Redshift
Like loading data from external files into Redshift, there is also an option to export data out of Redshift.
Export Data from Redshift, Using UNLOAD Command
Loding data out of Amazon Redshift can be done using UNLOAD command. You can simply select the data from Redshift and then provide a valid path to your S3 bucket to migrate data to. You can also filter the data in the select statement and then export your data as required. Once the query is ready, use the following command to unload data from Redshift to S3:
UNLOAD ('SELECT * FROM test.sample_csv')
TO 's3://csv-redshift-221/Unload_'
credentials 'aws_access_key_id=AKIA46SFIWN5JG7KM7O3;aws_secret_access_key=d4qfQNq4zYL39jcy4r4IWAxn4qPz4j8JgULvKa2d'
CSV;
Once the UNLOAD command is executed successfully, you can view the new file created under the S3 bucket.
The file is now available in the S3 bucket which can be downloaded and opened by any text editor.
Export Data from Redshift to CSV by Schedule, Using Skyvia
With Skyvia, you can export data from Redshift the same way as you imported data to it. For data migration from Redshift, sign in to Skyvia, open an export package, select Redshift as source, filter data you want to export, configure other package settings, create and run the package. Don’t forget to set a schedule for your package. Read more about Redshift export to CSV.
Conclusion
In this article, we’ve described several ways to import CSV to Redshift and vice versa. For those users who need data import/export from CSV on schedule, Skyvia will be of help. For more information, contact Skyvia support team.