Seamlessly Import Data from Amazon S3 to Amazon RDS PostgreSQL Using aws

Introduction

Amazon S3 is the primary storage repository for raw, semi-structured, or structured data in modern cloud environments. Meanwhile, Amazon RDS for PostgreSQL is a reliable, fully managed relational database system for transactional and analytical operations. Moving data from Amazon S3 to Amazon RDS PostgreSQL is routine for data engineers, especially for use cases like log ingestion, ETL pipeline outputs, or report generation.

AWS provides the aws_s3 extension in RDS PostgreSQL to simplify this data movement, enabling direct data import from Amazon S3 into a database table using a SQL command. This blog will walk you through the setup, usage, and best practices for leveraging the aws_s3.table_import_from_s3 function.

rds

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Prerequisites

Before beginning the import process, ensure the following conditions are met:

Amazon RDS for PostgreSQL: An instance running PostgreSQL version 11 or higher.
Amazon S3 Bucket: The source file (CSV format preferred) must exist in a bucket in the same AWS region as your RDS instance.
AWS IAM Role: Create an AWS IAM role with the AmazonS3ReadOnlyAccess policy to allow access to the Amazon S3 bucket.
Role Association: Attach the AWS IAM role to your Amazon RDS instance for the S3_IMPORT
Extension Enabled: Enable the aws_s3 extension on your PostgreSQL database.

Creating and Associating an AWS IAM Role

For Amazon RDS to read from your Amazon S3 bucket, you will need an AWS IAM role configured with appropriate permissions.

Step 1: Create a new AWS IAM role with the following trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "rds.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Service": "rds.amazonaws.com"

"Action": "sts:AssumeRole"

}

]

}

Step 2: Assign the AmazonS3ReadOnlyAccess policy to the created AWS IAM role.

Step 3: Use the AWS CLI or Console to link the AWS IAM role to your Amazon RDS instance:

aws rds add-role-to-db-instance \
  --db-instance-identifier <your-db-instance> \
  --role-arn arn:aws:iam::<account-id>:role/<role-name> \
  --feature-name S3_IMPORT

aws rds add-role-to-db-instance \

--db-instance-identifier <your-db-instance> \

--role-arn arn:aws:iam::<account-id>:role/<role-name> \

--feature-name S3_IMPORT

This grants your Amazon RDS instance permission to fetch files from the Amazon S3 bucket.

Enabling the aws_s3 Extension

Connect to your Amazon RDS PostgreSQL instance using a SQL client and run the following command to enable the extension:

CREATE EXTENSION IF NOT EXISTS aws_s3;

1	CREATE EXTENSION IF NOT EXISTS aws_s3;

This command enables the necessary functions for importing data.

Creating the Destination Table

Before importing, create a table that matches the schema of your source data. For example:

CREATE TABLE users (
    id INT,
    name TEXT,
    email TEXT
);

CREATE TABLE users (

id INT,

name TEXT,

email TEXT

);

Important: The order of columns and their data types should match the CSV file exactly to avoid loading errors.

Importing Data Using aws_s3.table_import_from_s3

Here’s how to perform the import:

SELECT aws_s3.table_import_from_s3(
    'users',
    '',
    '(format csv, header true)',
    aws_commons.create_s3_uri('your-bucket-name', 'path/to/file.csv', 'us-west-2')
);

SELECT aws_s3.table_import_from_s3(

'users',

'',

'(format csv, header true)',

aws_commons.create_s3_uri('your-bucket-name', 'path/to/file.csv', 'us-west-2')

);

Parameters Breakdown:

‘users’: The destination table.
”: Empty string implies all columns will be mapped automatically.
‘(format csv, header true)’: Specifies CSV file format with a header row.
create_s3_uri(…): Constructs the URI to the Amazon S3 object.

Best Practices & Tips

Follow these tips to ensure smooth imports:

Use Staging Tables: Import into temporary tables first for data validation before moving to production tables.
Maintain Format Consistency: Ensure your CSV files are consistently formatted with matching headers and delimiters.
Split Large Files: Divide huge datasets into smaller chunks to simplify troubleshooting.
Schedule Wisely: Run imports during non-peak hours to minimize locking and performance issues.
Enable Query Logging: Monitor slow queries and errors by enabling detailed PostgreSQL logs.

Real-World Use Case Example

Suppose you work at a logistics firm where daily shipment logs are saved in an Amazon S3 bucket. Each CSV contains shipment IDs, locations, and timestamps. You can set up a nightly import job into your Amazon RDS PostgreSQL reporting database like this:

SELECT aws_s3.table_import_from_s3(
    'shipments',
    '',
    '(format csv, header true)',
    aws_commons.create_s3_uri('my-logistics-bucket', 'daily_logs/2024-04-22.csv', 'us-east-1')
);

SELECT aws_s3.table_import_from_s3(

'shipments',

'',

'(format csv, header true)',

aws_commons.create_s3_uri('my-logistics-bucket', 'daily_logs/2024-04-22.csv', 'us-east-1')

);

This approach minimizes manual effort, saves transformation tool costs, and delivers timely data insights.

Performance Considerations

When dealing with large datasets, keep these performance tricks in mind:

Disable Indexes: Drop or disable indexes during imports and recreate them afterward to speed up loading.
Parallel Loads: Use multiple temporary tables to load parts of your data in parallel.
CloudWatch Monitoring: Monitor CPU, memory, and disk metrics to catch bottlenecks early.
Data Validation: Always validate your imported data to catch any anomalies early.

When to Use vs. Other AWS Services

Use aws_s3.table_import_from_s3 when:

Your source data is already in S3.
You don’t need a complex transformation.
You prefer a lightweight, serverless solution.

Use AWS Glue or AWS DMS when:

You need advanced transformations.
You’re syncing data between heterogeneous databases.
Your import involves large-scale orchestration.

Conclusion

The aws_s3.table_import_from_s3 function provides a lightweight, serverless option to load data from Amazon S3 into Amazon RDS PostgreSQL with minimal overhead. It significantly simplifies the data ingestion process, especially when dealing with reporting and analytical use cases that require timely data availability.

With basic setup, enabling the extension, configuring AWS IAM roles, and preparing your data, you can automate and streamline data transfer without needing separate ETL pipelines.

Drop a query if you have any questions regarding Amazon RDS PostgreSQL and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

FAQs

1. Can I use aws_s3.table_import_from_s3 with file formats other than CSV?

ANS: – Currently, the function primarily supports the CSV format. If you have JSON, Parquet, or other formats, convert them into CSV before using this function, or consider using AWS Glue for more flexible ETL operations.

2. Is it possible to automate the import process on a schedule?

ANS: – Yes, you can automate the import using AWS Lambda and Amazon EventBridge (Amazon CloudWatch Events). Trigger AWS Lambda function that connects to your Amazon RDS PostgreSQL and executes the import SQL statement at scheduled intervals (e.g., daily or hourly).

WRITTEN BY Sunil H G

Sunil H G is a highly skilled and motivated Research Associate at CloudThat. He is an expert in working with popular data analysis and visualization libraries such as Pandas, Numpy, Matplotlib, and Seaborn. He has a strong background in data science and can effectively communicate complex data insights to both technical and non-technical audiences. Sunil's dedication to continuous learning, problem-solving skills, and passion for data-driven solutions make him a valuable asset to any team.