Customize Language Translation with Machine Learning Tool: Amazon Translate- Part 2

TABLE OF CONTENT

1. Overview

2. Translating the SRT Files in Different Languages

3. Setting up a Trigger on S3

4. Translating the SRT and Storing the files in S3

Overview

Streaming video or audio content is a very effective way to share information, entertain, and engage users. Every organization these days has an extensive collection of videos or audio with captions and subtitles. Translated captions and subtitles can be provided in multiple languages to make these videos or audio available to more viewers. This blog will check how to use Amazon Translate to create an automated flow that translates captions and subtitles without losing context.

Captions and subtitles give people with hearing impairment access to the video or audio, provide flexibility for users in noisy and quiet environments, and help support non-native speakers. Captions or subtitles are usually rendered in SRT (.srt) or WebVTT (.vtt) format. SRT stands for SubRipSubtitle and is the most common file format for subtitles and captions. WebVTT stands for Web Video Text Track and is becoming a popular format for the same purpose. In this blog, we will check on Translating the SRT files into different languages.

Translating the SRT Files in Different Languages

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable and customized language translation. Neural machine translation is a form of automated language translation that uses machine learning models to deliver more accurate and natural sound translations than standard rule-based translation algorithms.

With Amazon Translate, you can create local content such as websites and apps for various users, easily translate significant texts for analysis, and effectively enable interaction between users.

This article will translate the data stored in a text file into different Languages. We will use S3 triggers that will make it possible to automate translation from start to end. Below is a detailed overview of what we will accomplish in this article.

Create a Lambda Role having access to the S3, Cloud Watch, and Amazon Translate service
Create an S3 bucket as an input and output bucket for Amazon
Create a Lambda function with Python Run time which will Extract caption text from a WebVTT or SRT file and create a delimited text file using an HTML tag.
The delimited text means removing the timestamp from the SRT file and converting it into a regular text
Then we Translate the Delimited File into Multiple Languages

After translation, we create the SRT files using the translated delimited file by adding the timestamp.

Setting up a Trigger on S3

Click on the ‘Add Trigger’ option on the lambda, select ‘S3’ as a source, and select the Event Type as ‘PUT.’ The prefix is the folder & suffix is the file type. We are considering only .srt files for the demo, and our Lambda will be triggered when the file is uploaded to the “input” folder.

Translating the SRT and Storing the files in S3

Increase the Lambda timeout from Configuration Settings. By default, it is set to 3 secs
Then we will import the required libraries like boto3 and webvtt etc
This code reads the Event and fetches the Bucket Name and File Name from the Event
Then we use the “get_object” API to get the object. You can also use the “download_file” API to download the file as per your requirement
Then we decode the encoded data to get the actual File data
Here, we consider the SRT to be in an “English” Language. If you want to detect the text automatically, you can call ‘The Amazon Comprehend Service API’ to detect the text Language
Now, we will Translate the “English” SRT file to Different Languages like Hindi, Marathi, and Tamil
Amazon Translate supports 75 Languages so that you can make modifications in code as per your requirements
Amazon Translate does not support SRT files for translation, so we need to convert the SRT files without the timestamp.
We are replacing the timestamp with the tag and then calling the Translate API. Translate will not translate the tag
We are calling the “translate_text” API to Translate the text. We give the source Language and the Target Language code
After the Translation of the Text, we again replace the tag with the timestamp
Then we create a Text file and upload the file to S3. We are adding the Language Code to the file name for easy access to the files

Lambda Code

This code gets invoked from the S3 Event and fetches the file data. Then we call the functions from the “srtCaptions” file, which helps remove the timestamp from the file and convert it into normal text for translation. Then we Translate the text as per our requirement and again add the time stamp to the Translated text.

import boto3
import requests
from urllib.parse import unquote_plus
from srtCaptions import *
s3 = boto3.client('s3')
translate = boto3.client('translate')
def lambda_handler(event, context):
    try:
        print(event)
        fileName = unquote_plus(event['Records'][0]['s3']['object']['key'])
        fileBucket = event['Records'][0]['s3']['bucket']['name']
        resp=s3.get_object(Bucket=fileBucket, Key=fileName)
        subtitles=resp['Body'].read().decode("utf-8")
        rep=subtitles.replace('0', '1', 1)
        srt=srtToCaptions(rep)
        delimitedFile=ConvertToDemilitedFiles(srt)
        TargetLanguage='hi'
        translatedData=translate.translate_text(Text=delimitedFile, SourceLanguageCode='en', TargetLanguageCode=TargetLanguage)
        translatedCaptionsList = DelimitedToWebCaptions(srt,translatedData['TranslatedText'],"<span>",15)
        captionsSRT=captionsToSRT(translatedCaptionsList)
        filename=fileName.split('/')[1].split('.')[0]+'_'+TargetLanguage+'.srt'
        file = open(f'/tmp/{filename}', "w") 
        file.write(captionsSRT) 
        file.close() 
        s3.upload_file(
                        Filename = f'/tmp/{filename}' , 
                        Bucket = "test-bucket-translate-demo" , 
                        Key = f'output/{filename}'
                            )
        return {
                "statusCode": 200,
                "body": captionsSRT
            }

    except Exception as e:
        print(e)
        return {
            "statusCode": 400,
            "body": 'Error in Execution !!'
        }

import boto3

import requests

from urllib.parse import unquote_plus

from srtCaptions import *

s3 = boto3.client('s3')

translate = boto3.client('translate')

def lambda_handler(event, context):

try:

print(event)

fileName = unquote_plus(event['Records'][0]['s3']['object']['key'])

fileBucket = event['Records'][0]['s3']['bucket']['name']

resp=s3.get_object(Bucket=fileBucket, Key=fileName)

subtitles=resp['Body'].read().decode("utf-8")

rep=subtitles.replace('0', '1', 1)

srt=srtToCaptions(rep)

delimitedFile=ConvertToDemilitedFiles(srt)

TargetLanguage='hi'

translatedData=translate.translate_text(Text=delimitedFile, SourceLanguageCode='en', TargetLanguageCode=TargetLanguage)

translatedCaptionsList = DelimitedToWebCaptions(srt,translatedData['TranslatedText'],"",15)

captionsSRT=captionsToSRT(translatedCaptionsList)

filename=fileName.split('/')[1].split('.')[0]+'_'+TargetLanguage+'.srt'

file = open(f'/tmp/{filename}', "w")

file.write(captionsSRT)

file.close()

s3.upload_file(

Filename = f'/tmp/{filename}' ,

Bucket = "test-bucket-translate-demo" ,

Key = f'output/{filename}'

)

return {

"statusCode": 200,

"body": captionsSRT

}

except Exception as e:

print(e)

return {

"statusCode": 400,

"body": 'Error in Execution !!'

}

Code in srtCaptions.py file

This file contains the code which will remove the timestamp from the SRT file, and will convert it into text that we can use for Translation. After Translation, again we will add the timestamp to the translated text and store it in S3.

from tempfile import NamedTemporaryFile
import math
import html
import re
import webvtt
from io import StringIO
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
def srtToCaptions(srt):
        captions = []
        f = NamedTemporaryFile(mode='w+', delete=False)
        f.write(srt)
        f.close()
        for srtcaption in webvtt.from_srt(f.name):
            caption = {}
            logger.debug(srtcaption)
            caption["start"] = formatTimeSRTtoSeconds(srtcaption.start)
            caption["end"] = formatTimeSRTtoSeconds(srtcaption.end)
            caption["caption"] = srtcaption.lines[0]
            logger.debug("Caption Object:{}".format(caption))
            captions.append(caption)
        return captions

def formatTimeSRT(timeSeconds):
        ONE_HOUR = 60 * 60
        ONE_MINUTE = 60
        hours = math.floor(timeSeconds / ONE_HOUR)
        remainder = timeSeconds - (hours * ONE_HOUR)
        minutes = math.floor(remainder / 60)
        remainder = remainder - (minutes * ONE_MINUTE)
        seconds = math.floor(remainder)
        remainder = remainder - seconds
        millis = remainder
        return str(hours).zfill(2) + ':' + str(minutes).zfill(2) + ':' + str(seconds).zfill(2) + ',' + str(math.floor(millis * 1000)).zfill(3)

def formatTimeSRTtoSeconds(timeHMSf):
        hours, minutes, seconds = (timeHMSf.split(":"))[-3:]
        hours = int(hours)
        minutes = int(minutes)
        seconds = float(seconds)
        timeSeconds = float(3600 * hours + 60 * minutes + seconds)
        return str(timeSeconds)
        
def captionsToSRT(captions):
        srt = ''
        index = 0
        for caption in captions:
            srt += str(index) + '\n'
            srt += formatTimeSRT(float(caption["start"])) + ' --> ' + formatTimeSRT(float(caption["end"])) + '\n'
            srt += caption["caption"] + '\n\n'
            index += 1
        return srt.rstrip()
        
def ConvertToDemilitedFiles(inputCaptions):
        marker = "<span>"
        # Convert captions to text with marker between caption lines
        inputEntries = map(lambda c: c["caption"], inputCaptions)
        inputDelimited = marker.join(inputEntries)
        logger.debug(inputDelimited)
        return inputDelimited
 
def DelimitedToWebCaptions(sourceWebCaptions, delimitedCaptions, delimiter, maxCaptionLineLength):
        delimitedCaptions = html.unescape(delimitedCaptions)
        entries = delimitedCaptions.split(delimiter)
        outputWebCaptions = []
        for i, c in enumerate(sourceWebCaptions):
            caption = {}
            caption["start"] = c["start"]
            caption["end"] = c["end"]
            caption["caption"] = entries[i]
            caption["sourceCaption"] = c["caption"]
            outputWebCaptions.append(caption)
        return outputWebCaptions

from tempfile import NamedTemporaryFile

import math

import html

import re

import webvtt

from io import StringIO

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger(__name__)

def srtToCaptions(srt):

captions = []

f = NamedTemporaryFile(mode='w+', delete=False)

f.write(srt)

f.close()

for srtcaption in webvtt.from_srt(f.name):

caption = {}

logger.debug(srtcaption)

caption["start"] = formatTimeSRTtoSeconds(srtcaption.start)

caption["end"] = formatTimeSRTtoSeconds(srtcaption.end)

caption["caption"] = srtcaption.lines[0]

logger.debug("Caption Object:{}".format(caption))

captions.append(caption)

return captions

def formatTimeSRT(timeSeconds):

ONE_HOUR = 60 * 60

ONE_MINUTE = 60

hours = math.floor(timeSeconds / ONE_HOUR)

remainder = timeSeconds - (hours * ONE_HOUR)

minutes = math.floor(remainder / 60)

remainder = remainder - (minutes * ONE_MINUTE)

seconds = math.floor(remainder)

remainder = remainder - seconds

millis = remainder

return str(hours).zfill(2) + ':' + str(minutes).zfill(2) + ':' + str(seconds).zfill(2) + ',' + str(math.floor(millis * 1000)).zfill(3)

def formatTimeSRTtoSeconds(timeHMSf):

hours, minutes, seconds = (timeHMSf.split(":"))[-3:]

hours = int(hours)

minutes = int(minutes)

seconds = float(seconds)

timeSeconds = float(3600 * hours + 60 * minutes + seconds)

return str(timeSeconds)

def captionsToSRT(captions):

srt = ''

index = 0

for caption in captions:

srt += str(index) + '\n'

srt += formatTimeSRT(float(caption["start"])) + ' --> ' + formatTimeSRT(float(caption["end"])) + '\n'

srt += caption["caption"] + '\n\n'

index += 1

return srt.rstrip()

def ConvertToDemilitedFiles(inputCaptions):

marker = ""

# Convert captions to text with marker between caption lines

inputEntries = map(lambda c: c["caption"], inputCaptions)

inputDelimited = marker.join(inputEntries)

logger.debug(inputDelimited)

return inputDelimited

def DelimitedToWebCaptions(sourceWebCaptions, delimitedCaptions, delimiter, maxCaptionLineLength):

delimitedCaptions = html.unescape(delimitedCaptions)

entries = delimitedCaptions.split(delimiter)

outputWebCaptions = []

for i, c in enumerate(sourceWebCaptions):

caption = {}

caption["start"] = c["start"]

caption["end"] = c["end"]

caption["caption"] = entries[i]

caption["sourceCaption"] = c["caption"]

outputWebCaptions.append(caption)

return outputWebCaptions

Conclusion

When we upload a text file in our S3 bucket, our Lambda will be triggered, and after execution of our Lambda, we will be able to see SRT files in our S3 bucket Output Folder containing the translated SRT files. This SRT file can be used per the business requirements for further processing, depending on the use case.

Refer to ‘Translate Text to Different Languages using Amazon Translate- Part 1’ for more information about Amazon Translate.

About CloudThat

CloudThat is the official AWS Advanced Consulting Partner, Microsoft Gold Partner, and Training partner helping people develop knowledge on the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

If you have any queries about Amazon SageMaker, Natural Language Processing, Hugging Face, or anything related to AWS services, feel free to drop in a comment. We will get back to you quickly. Visit our Consulting Page for more updates on our customer offerings, expertise, and cloud services.

FAQs

What are the different inputs which Amazon Translate supports?

Ans. Amazon Translate supports plain text input in UTF-8 format.

What are the size limits on the Translate API?

Ans. Amazon Translate API calls are limited to 5,000 bytes per API call. Amazon Translate, an asynchronous Batch Translation service, accepts a batch of up to 5 GB in size per API call

Does Amazon Translate provide automatic source language detection?

Ans. Amazon Translate automatically detects source language using Amazon Comprehend behind the scenes if the source language is unknown.

Are requests where the source language and the target language are the same charged?

Ans. No, Requests are not charged if the source language equals the target language.

Voiced by Amazon Polly

WRITTEN BY Sanket Gaikwad

Sanket is a Cloud-Native Backend Developer at CloudThat, specializing in serverless development, backend systems, and modern frontend frameworks such as React. His expertise spans cloud-native architectures, Python, Dynamics 365, and AI/ML solution design, enabling him to play a key role in building scalable, intelligent applications. Combining strong backend proficiency with a passion for cloud technologies and automation, Sanket delivers robust, enterprise-grade solutions. Outside of work, he enjoys playing cricket and exploring new places through travel.