Voiced by Amazon Polly |
Overview
Efficiently managing database growth is crucial for optimizing costs and maintaining performance. As data accumulates in Amazon RDS, archiving old or less frequently accessed data to Amazon S3 offers a cost-effective solution. Organizations can build a scalable and seamless archival pipeline by leveraging AWS Lambda for automation.
This blog provides a detailed guide to automate the data archival process, covering architecture, implementation, and best practices.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Why Automate Data Archival?
- Cost Efficiency – Amazon S3 offers lower storage costs compared to Amazon RDS. Archiving data reduces operational expenses.
- Enhanced Database Performance – Removing unused data minimizes query times, backup durations, and resource consumption.
- Regulatory Compliance – Facilitates adherence to data retention policies by securely storing archival data.
- Scalability – Automation handles large datasets without manual intervention, reducing operational overhead.
Architecture Overview
The proposed solution involves the following AWS services:
- Amazon RDS: The source database containing data to be archived.
- Amazon S3: Destination for archived data.
- AWS Lambda: Executes the data extraction and transfer logic.
- Amazon CloudWatch Events: Schedules the archival process.
- AWS Identity and Access Management (IAM): Secures access for AWS Lambda to interact with Amazon RDS and Amazon S3.
Workflow
- AWS Lambda connects to the Amazon RDS instance and queries records flagged for archival.
- The data is extracted, transformed into a CSV file, and uploaded to an Amazon S3 bucket.
- Once the archival data is uploaded, the corresponding rows are deleted from Amazon RDS.
- Amazon CloudWatch Events triggers the AWS Lambda function periodically.
Step-by-Step Implementation
Step 1: Set Up Amazon RDS
- Create or Identify the Amazon RDS Instance
- Ensure the Amazon RDS database has sufficient resources to handle data archival queries.
2. Add an Archive Flag
- Modify your database schema to include a column (e.g., archive_flag) for marking rows eligible for archival.
- Example SQL:
1 |
ALTER TABLE transactions ADD archive_flag BOOLEAN DEFAULT FALSE; |
Step 2: Create an Amazon S3 Bucket
- Set Up the Bucket
- Use the Amazon S3 console to create a bucket (e.g., my-data-archive-bucket).
- Enable server-side encryption to secure archived data.
- Configure Lifecycle Policies
- Define policies to transition old data to cheaper storage tiers like Amazon S3 Glacier.
Step 3: Write the AWS Lambda Function
Install Dependencies
Use the AWS Lambda runtime for Python and package required libraries like boto3 and pymysql.
1 2 |
pip install pymysql -t ./lambda_package pip install boto3 -t ./lambda_package |
AWS Lambda Code Example
Here is the Python code for the AWS Lambda function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
import boto3 import pymysql import csv import os from datetime import datetime # Environment Variables RDS_HOST = os.environ['RDS_HOST'] RDS_USER = os.environ['RDS_USER'] RDS_PASSWORD = os.environ['RDS_PASSWORD'] RDS_DATABASE = os.environ['RDS_DATABASE'] S3_BUCKET = os.environ['S3_BUCKET'] ARCHIVE_TABLE = os.environ['ARCHIVE_TABLE'] def lambda_handler(event, context): try: # Connect to the RDS instance connection = pymysql.connect( host=RDS_HOST, user=RDS_USER, password=RDS_PASSWORD, database=RDS_DATABASE ) cursor = connection.cursor() # Query records for archival query = f"SELECT * FROM {ARCHIVE_TABLE} WHERE archive_flag = TRUE" cursor.execute(query) rows = cursor.fetchall() if not rows: print("No records found for archival.") return # Save data to a CSV file timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") file_name = f"{ARCHIVE_TABLE}_archive_{timestamp}.csv" file_path = f"/tmp/{file_name}" with open(file_path, 'w', newline='') as csvfile: writer = csv.writer(csvfile) # Write header writer.writerow([desc[0] for desc in cursor.description]) # Write data rows writer.writerows(rows) # Upload the file to S3 s3_client = boto3.client('s3') s3_client.upload_file(file_path, S3_BUCKET, file_name) print(f"Archived data uploaded to S3: {file_name}") # Delete archived records delete_query = f"DELETE FROM {ARCHIVE_TABLE} WHERE archive_flag = TRUE" cursor.execute(delete_query) connection.commit() print("Archived records deleted from RDS.") except Exception as e: print(f"Error occurred: {str(e)}") finally: cursor.close() connection.close() |
Environment Variables
Set these environment variables in your AWS Lambda function configuration:
- RDS_HOST
- RDS_USER
- RDS_PASSWORD
- RDS_DATABASE
- S3_BUCKET
- ARCHIVE_TABLE
Step 4: Grant Permissions to Lambda
- Attach an AWS IAM role to AWS Lambda with:
- AmazonS3FullAccess or bucket-specific permissions.
- Access to Amazon RDS using AWSSecretsManagerReadWrite or Amazon RDS policy.
Step 5: Schedule AWS Lambda with Amazon CloudWatch Events
Use Amazon CloudWatch to trigger AWS Lambda on a regular schedule:
- Example: Run daily at midnight using a cron expression:
1 2 3 |
{ "cron": "0 0 * * ? *" } |
Best Practices
- Data Filtering: Ensure your archival query selects only necessary rows to minimize overhead.
- Encryption: Use Amazon S3 bucket encryption for data at rest and SSL/TLS for data in transit.
- Monitoring: Configure Amazon CloudWatch to log AWS Lambda execution details and errors.
- Testing: Run the AWS Lambda function in a test environment to validate results before deploying.
- Scalability: Use Amazon RDS to read replicas and query archival data if your workload increases.
Conclusion
Following the steps and best practices outlined in this blog, you can build a robust, scalable, and secure data archival pipeline.
Drop a query if you have any questions regarding Data Archival and we will get back to you quickly.
Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.
- Reduced infrastructure costs
- Timely data-driven decisions
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
FAQs
1. How do I determine which data to archive from my Amazon RDS database?
ANS: – You can define a specific criterion for identifying archival data, such as a date column (e.g., records older than a certain date) or a custom flag column (archive_flag). SQL queries with these filters ensure only relevant data is selected for archival.
2. How can I ensure the security of archived data in Amazon S3?
ANS: – To secure data in Amazon S3:
- Enable server-side encryption (e.g., AES-256 or AWS KMS keys).
- Use bucket policies and AWS IAM roles to restrict access.
- Enable Amazon S3 versioning for data protection against accidental overwrites or deletions.
- Use Amazon S3 Block Public Access settings to prevent unauthorized access.
WRITTEN BY Deepak Kumar Manjhi
Comments