Automate File conversion with AWS Batch and S3

Overview

AWS Batch is a fully managed service provided by Amazon Web Services (AWS) that enables developers to run batch computing workloads in the cloud.

It simplifies provisioning and managing the infrastructure required to execute batch jobs, allowing users to focus on their applications rather than infrastructure management.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Key Functionalities of AWS Batch

Job Scheduling and Orchestration: AWS Batch provides a robust job scheduling and orchestration system that allows you to define dependencies and priorities for batch jobs.
Scalable Compute Resources: With AWS Batch, you can easily scale your compute resources up or down based on the demand of your batch jobs.
Docker Container Support: AWS Batch integrates seamlessly with Docker containers, enabling you to package your batch job applications and their dependencies.
Cost Optimization: AWS Batch helps optimize costs by allowing you to define and manage the allocation of compute resources based on specific requirements and workload patterns.

Use Cases for AWS Batch

Data Processing and ETL (Extract, Transform, Load): AWS Batch is well-suited for large-scale data processing tasks like ETL pipelines.
Scientific and Research Computing: Researchers and scientists often need to perform computationally intensive simulations or data analysis.
Media Processing and Encoding: Media companies can leverage AWS Batch to process and encode large volumes of media files.
Financial Analytics: Financial institutions can benefit from AWS Batch for running financial analytics and risk modeling computations.

Example

Suppose you work for a publishing company that needs to convert many HTML documents into PDF format for digital distribution. Performing this conversion manually can be time-consuming and tedious.

Here’s how you can leverage AWS Batch to automate the HTML to PDF conversion process:

Gather HTML files and store them in an Amazon S3 bucket.
Define a Docker-based job for HTML to PDF conversion in AWS Batch.
Configure compute environment for job execution, optimizing resources.
Submit the job to AWS Batch for automatic scheduling and resource allocation.
AWS Batch executes the job, converting HTML files to PDF in parallel.
Monitor job progress, troubleshoot with logging, and set up notifications.
Store converted PDF files in desired output location for distribution or archiving.

By utilizing AWS Batch for HTML to PDF conversion, you benefit from the managed infrastructure, scalability, and automation the service provides. It lets you focus on the content and conversion logic rather than the underlying infrastructure management.

Configure the Conversion Code

The required code to perform the practical is provided in this GitHub repo. Clone or fork this repository to perform the practical.

https://github.com/heistprofessor/aws-batch.git

Step-by-Step Guide

Step 1: Uploading Source HTML Files to S3:

First, we must upload the HTML files you want to convert to PDF to an Amazon S3 bucket. If you haven’t already created an Amazon S3 bucket, navigate to the AWS S3 service in the AWS Management Console and create a new bucket.

Upload your HTML files to the S3 bucket, ensuring each file has a unique key or name. Note down the bucket name and the keys/names of the HTML files, as we will need them later.

Step 2: Building the Docker image.

Fork the given repo and edit the Python file named ‘app.py’ for the given parameters:

AWS access key
AWS Secret access key
S3 Source bucket
S3 Destination bucket
S3 Source key
S3 Destination key

Now in the terminal, pass this command to build the image, which we will later use to build the docker image.

‘docker image –t html-to-pdf’

1	‘docker image –t html-to-pdf’

step2

step2b

Step 3: Tag the docker image and push it into the AWS ECR repository.

To tag and push a Docker image to an AWS ECR repository:

Tag the Docker image with the ECR repository URI:

docker tag <image-id> <aws-account-id>.dkr.ecr.<region>.amazonaws.com/<repository-name>:<tag>

1	docker tag <image-id> <aws-account-id>.dkr.ecr.<region>.amazonaws.com/<repository-name>:<tag>

aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.<region>.amazonaws.com

1	aws ecr get-login-password --region <region> \| docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.<region>.amazonaws.com

Push the tagged Docker image to the ECR repository:

docker push <aws-account-id>.dkr.ecr.<region>.amazonaws.com/<repository-name>:<tag>

1	docker push <aws-account-id>.dkr.ecr.<region>.amazonaws.com/<repository-name>:<tag>

Ensure you replace <image-id>, <aws-account-id>, <region>, <repository-name>, and <tag> with the appropriate values for your setup.

Once pushed the image will appear like this

step3

Step 4: Navigate to the AWS Management Console and navigate to the AWS batch.

step4

Step 5: Configure AWS Batch Environment:

Next, we must set up an AWS Batch environment to run our conversion job. Follow these steps:

Configure AWS Batch compute environment: Set up desired compute resources.
Define AWS Batch job queue: Create a queue for conversion job requests.
Create AWS Batch job definition: Specify container image, command, and parameters.
Fill in the required details for the compute environments section and click ‘Create compute environments’.

Environment configuration – Fargate

Name – html-to-pdf

Service role – AWSServiceRoleForBatch (Default role)

Maximum vCPUs – 2

Select appropriate VPC, subnets, and security group.

Review the details and click on create

step5

5. Next, navigate the Job queue from the left Pane and click Create. Select the orchestration type as Fargate, provide a name, set priority to 100, and select the previously created compute environment. Click on Create Job queue.

step5b

6. Next, navigate to Job Definitions and click on Create. Choose the orchestration type as Fargate, provide a name, enable assign public IP, choose the execution role, and click on next.

On the next page, paste the image URI copied earlier in the image URI option. In command syntax, give the below command as JSON and click on the next page.

Command syntax:

["python","html_to_pdf.py","--source-bucket","<src-bucket-name>","--source-key","index.html","<dst-bucket-name>","employee-website","--destination-key","index.pdf"]

1	["python","html_to_pdf.py","--source-bucket","<src-bucket-name>","--source-key","index.html","<dst-bucket-name>","employee-website","--destination-key","index.pdf"]

Select AWS logs in logging and click on the next page

Review the details and create it

step5c

7. Navigate to Job and submit a new job. Provide a name, select the job definition and job queue, and click next page. Check the vCPUs and Memory and click on the next page. Review the details and create the job.

step5d

Job is succeeded.

step5e

Source bucket where index.html file is located

step5f

Destination bucket where index.pdf file is uploaded after conversion.

step5g

Conclusion

AWS Batch provides a streamlined solution for automating batch computing workloads like HTML to PDF conversion. It simplifies job scheduling, resource allocation, and infrastructure management, reducing processing time. Monitoring and logging features ensure visibility and troubleshooting. Organizations leveraging AWS Batch optimize workflows, save time, and enhance productivity.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. How does AWS Batch differ from other compute services provided by AWS?

ANS: – AWS Batch is specifically designed for batch computing workloads, focusing on efficient job scheduling, resource allocation, and scalability.

2. What are the key benefits of using AWS Batch?

ANS: – Some key benefits of using AWS Batch include simplified infrastructure management, automatic job scheduling, and resource allocation, scalability to handle varying workloads.

3. How does AWS Batch handle job scheduling and resource allocation?

ANS: – AWS Batch provides a job scheduling and orchestration system that allows you to define dependencies and priorities for batch jobs.

4. Can I customize the compute environment in AWS Batch?

ANS: – Yes, you can customize the compute environment in AWS Batch. You can define compute resources based on your specific requirements.