{"id":13128,"date":"2022-07-11T09:58:23","date_gmt":"2022-07-11T09:58:23","guid":{"rendered":"https:\/\/blog.cloudthat.com\/?p=13128"},"modified":"2024-06-25T10:57:32","modified_gmt":"2024-06-25T10:57:32","slug":"customize-language-translation-with-machine-learning-tool-amazon-translate","status":"publish","type":"blog","link":"https:\/\/www.cloudthat.com\/resources\/blog\/customize-language-translation-with-machine-learning-tool-amazon-translate-part-2","title":{"rendered":"Customize Language Translation with Machine Learning Tool: Amazon Translate- Part 2"},"content":{"rendered":"<table style=\"height: 231px;\" border=\"0\" width=\"411\">\n<tbody>\n<tr>\n<td>\n<h2><span style=\"color: #000080;\"><strong>TABLE OF CONTENT<\/strong><\/span><\/h2>\n<\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#overview\">1. Overview<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#translatesrtfiles\">2. Translating the SRT Files in Different Languages<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#settinguptraiggerons3\">3. Setting up a Trigger on S3<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#translatingsrt\">4. Translating the SRT and Storing the files in S3<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#lambdacode\">5. Lambda Code<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#conclusion\">6. Conclusion<\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#aboutcloudthat\">7. About CloudThat <\/a><\/td>\n<\/tr>\n<tr>\n<td><a style=\"margin-left: 20px;\" href=\"#faqs\">8. FAQs<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2 id=\"overview\"><strong>Overview<\/strong><\/h2>\n<p><span style=\"color: #000000;\">Streaming video or audio content is a\u00a0very\u00a0effective way to\u00a0share information,\u00a0entertain, and engage users. Every organization these days has\u00a0an extensive collection of videos or audio\u00a0with\u00a0captions\u00a0and\u00a0subtitles.\u00a0Translated captions and subtitles can be provided in multiple\u00a0languages \u200b\u200bto make these videos or audio available to more viewers.\u00a0This blog will check how to\u00a0use Amazon Translate to\u00a0create an automated flow\u00a0that translates\u00a0captions and subtitles without losing context.<\/span><\/p>\n<p><span style=\"color: #000000;\">Captions and subtitles\u00a0give people with hearing impairment access to the video or audio,\u00a0provide flexibility\u00a0for\u00a0users in noisy\u00a0and\u00a0quiet environments, and\u00a0help support non-native\u00a0speakers. Captions or subtitles are\u00a0usually rendered\u00a0in SRT (.srt) or WebVTT (.vtt) format. SRT stands for\u00a0SubRipSubtitle\u00a0and is the most common file format for subtitles and captions. WebVTT stands for Web Video Text\u00a0Track\u00a0and is becoming a popular format for the same purpose. In this blog, we will check on Translating the SRT files into different languages.<\/span><\/p>\n<h2 id=\"translatesrtfiles\"><strong>Translating the SRT Files in Different Languages<\/strong><\/h2>\n<p><span style=\"color: #000000;\">Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable and customized language translation. Neural machine translation is a form of automated language translation that uses machine learning models to deliver more accurate and natural sound translations than standard rule-based translation algorithms.<\/span><\/p>\n<p><span style=\"color: #000000;\">With <strong>Amazon Translate<\/strong>, you can create local content such as websites and apps for various users, easily translate significant texts for analysis, and effectively enable interaction between users.<\/span><\/p>\n<p><span style=\"color: #000000;\">This article will translate the data stored in a text file into different Languages. We will use S3 triggers that will make it possible to automate translation from start to end. Below is a detailed overview of what we will accomplish in this article.<\/span><\/p>\n<ul>\n<li><span style=\"color: #000000;\">Create a Lambda Role having access to the S3, Cloud Watch, and Amazon Translate service<\/span><\/li>\n<li><span style=\"color: #000000;\">Create an S3 bucket as an input and output bucket for Amazon<\/span><\/li>\n<li><span style=\"color: #000000;\">Create a Lambda function with Python Run time which will Extract caption text from a WebVTT or SRT file and create a delimited text file using an HTML tag.<\/span><\/li>\n<li><span style=\"color: #000000;\">The delimited text means removing the timestamp from the SRT file and converting it into a regular text<\/span><\/li>\n<li><span style=\"color: #000000;\">Then we Translate the Delimited File into Multiple Languages<\/span><\/li>\n<\/ul>\n<p><span style=\"color: #000000;\">After translation, we create the SRT files using the translated delimited file by adding the timestamp.<\/span><\/p>\n<p><a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/translatenew.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13131\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/translatenew.png\" alt=\"Amazon Translate\" width=\"1003\" height=\"815\" \/><\/a><\/p>\n<h2 id=\"settinguptraiggerons3\"><strong>Setting up a Trigger on S3<\/strong><\/h2>\n<p><span style=\"color: #000000;\">Click on the \u2018<strong>Add Trigger\u2019<\/strong> option on the lambda, select \u2018S3\u2019 as a source, and select the Event Type as \u2018<strong>PUT.\u2019<\/strong> The prefix is the folder &amp; suffix is the file type. We are considering only .<strong>srt<\/strong> files for the demo, and our Lambda will be triggered when the file is uploaded to the \u201cinput\u201d folder.<\/span><\/p>\n<p><a href=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/translate2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13132\" src=\"https:\/\/d1f7lmxeo98xps.cloudfront.net\/resources\/wp-content\/uploads\/2022\/11\/translate2.png\" alt=\"Amazon Translate\" width=\"1727\" height=\"640\" \/><\/a><\/p>\n<h2 id=\"translatingsrt\"><strong>Translating the SRT and Storing the files in S3<\/strong><\/h2>\n<ul>\n<li><span style=\"color: #000000;\">Increase the Lambda timeout from Configuration Settings. By default, it is set to 3 secs<\/span><\/li>\n<li><span style=\"color: #000000;\">Then we will import the required libraries like boto3 and webvtt etc<\/span><\/li>\n<li><span style=\"color: #000000;\">This code reads the Event and fetches the Bucket Name and File Name from the Event<\/span><\/li>\n<li><span style=\"color: #000000;\">Then we use the \u201cget_object\u201d API to get the object. You can also use the \u201cdownload_file\u201d API to download the file as per your requirement<\/span><\/li>\n<li><span style=\"color: #000000;\">Then we decode the encoded data to get the actual File data<\/span><\/li>\n<li><span style=\"color: #000000;\">Here, we consider the SRT to be in an \u201cEnglish\u201d Language. If you want to detect the text automatically, you can call \u2018The Amazon Comprehend Service API\u2019 to detect the text Language<\/span><\/li>\n<li><span style=\"color: #000000;\">Now, we will Translate the \u201cEnglish\u201d SRT file to Different Languages like Hindi, Marathi, and Tamil<\/span><\/li>\n<li><span style=\"color: #000000;\">Amazon Translate supports 75 Languages so that you can make modifications in code as per your requirements<\/span><\/li>\n<li><span style=\"color: #000000;\">Amazon Translate does not support SRT files for translation, so we need to convert the SRT files without the timestamp.<\/span><\/li>\n<li><span style=\"color: #000000;\">We are replacing the timestamp with the &lt;span&gt; tag and then calling the Translate API. Translate will not translate the &lt;span&gt; tag<\/span><\/li>\n<li><span style=\"color: #000000;\">We are calling the \u201ctranslate_text\u201d API to Translate the text. We give the source Language and the Target Language code<\/span><\/li>\n<li><span style=\"color: #000000;\">After the Translation of the Text, we again replace the &lt;span&gt; tag with the timestamp<\/span><\/li>\n<li><span style=\"color: #000000;\">Then we create a Text file and upload the file to S3. We are adding the Language Code to the file name for easy access to the files<\/span><\/li>\n<\/ul>\n<h2 id=\"lambdacode\"><strong>Lambda Code<\/strong><\/h2>\n<p><span style=\"color: #000000;\">This code gets invoked from the S3 Event and fetches the file data. Then we call the functions from the \u201csrtCaptions\u201d file, which helps remove the timestamp from the file and convert it into normal text for translation. Then we Translate the text as per our requirement and again add the time stamp to the Translated text.<\/span><\/p>\n<pre class=\"theme:dark-terminal nums:false nums-toggle:false lang:default decode:true\">import boto3\r\nimport requests\r\nfrom urllib.parse import unquote_plus\r\nfrom srtCaptions import *\r\ns3 = boto3.client('s3')\r\ntranslate = boto3.client('translate')\r\ndef lambda_handler(event, context):\r\n    try:\r\n        print(event)\r\n        fileName = unquote_plus(event['Records'][0]['s3']['object']['key'])\r\n        fileBucket = event['Records'][0]['s3']['bucket']['name']\r\n        resp=s3.get_object(Bucket=fileBucket, Key=fileName)\r\n        subtitles=resp['Body'].read().decode(\"utf-8\")\r\n        rep=subtitles.replace('0', '1', 1)\r\n        srt=srtToCaptions(rep)\r\n        delimitedFile=ConvertToDemilitedFiles(srt)\r\n        TargetLanguage='hi'\r\n        translatedData=translate.translate_text(Text=delimitedFile, SourceLanguageCode='en', TargetLanguageCode=TargetLanguage)\r\n        translatedCaptionsList = DelimitedToWebCaptions(srt,translatedData['TranslatedText'],\"&lt;span&gt;\",15)\r\n        captionsSRT=captionsToSRT(translatedCaptionsList)\r\n        filename=fileName.split('\/')[1].split('.')[0]+'_'+TargetLanguage+'.srt'\r\n        file = open(f'\/tmp\/{filename}', \"w\") \r\n        file.write(captionsSRT) \r\n        file.close() \r\n        s3.upload_file(\r\n                        Filename = f'\/tmp\/{filename}' , \r\n                        Bucket = \"test-bucket-translate-demo\" , \r\n                        Key = f'output\/{filename}'\r\n                            )\r\n        return {\r\n                \"statusCode\": 200,\r\n                \"body\": captionsSRT\r\n            }\r\n\r\n    except Exception as e:\r\n        print(e)\r\n        return {\r\n            \"statusCode\": 400,\r\n            \"body\": 'Error in Execution !!'\r\n        }<\/pre>\n<p><span style=\"color: #000000;\"><strong>Code in srtCaptions.py file<\/strong><\/span><\/p>\n<p><span style=\"color: #000000;\">This file contains the code which will remove the timestamp from the SRT file, and will convert it into text that we can use for Translation. After Translation, again we will add the timestamp to the translated text and store it in S3.<\/span><\/p>\n<pre class=\"theme:dark-terminal nums:false nums-toggle:false lang:default decode:true \">from tempfile import NamedTemporaryFile\r\nimport math\r\nimport html\r\nimport re\r\nimport webvtt\r\nfrom io import StringIO\r\nimport logging\r\nlogging.basicConfig(level=logging.DEBUG)\r\nlogger = logging.getLogger(__name__)\r\ndef srtToCaptions(srt):\r\n        captions = []\r\n        f = NamedTemporaryFile(mode='w+', delete=False)\r\n        f.write(srt)\r\n        f.close()\r\n        for srtcaption in webvtt.from_srt(f.name):\r\n            caption = {}\r\n            logger.debug(srtcaption)\r\n            caption[\"start\"] = formatTimeSRTtoSeconds(srtcaption.start)\r\n            caption[\"end\"] = formatTimeSRTtoSeconds(srtcaption.end)\r\n            caption[\"caption\"] = srtcaption.lines[0]\r\n            logger.debug(\"Caption Object:{}\".format(caption))\r\n            captions.append(caption)\r\n        return captions\r\n\r\ndef formatTimeSRT(timeSeconds):\r\n        ONE_HOUR = 60 * 60\r\n        ONE_MINUTE = 60\r\n        hours = math.floor(timeSeconds \/ ONE_HOUR)\r\n        remainder = timeSeconds - (hours * ONE_HOUR)\r\n        minutes = math.floor(remainder \/ 60)\r\n        remainder = remainder - (minutes * ONE_MINUTE)\r\n        seconds = math.floor(remainder)\r\n        remainder = remainder - seconds\r\n        millis = remainder\r\n        return str(hours).zfill(2) + ':' + str(minutes).zfill(2) + ':' + str(seconds).zfill(2) + ',' + str(math.floor(millis * 1000)).zfill(3)\r\n\r\ndef formatTimeSRTtoSeconds(timeHMSf):\r\n        hours, minutes, seconds = (timeHMSf.split(\":\"))[-3:]\r\n        hours = int(hours)\r\n        minutes = int(minutes)\r\n        seconds = float(seconds)\r\n        timeSeconds = float(3600 * hours + 60 * minutes + seconds)\r\n        return str(timeSeconds)\r\n        \r\ndef captionsToSRT(captions):\r\n        srt = ''\r\n        index = 0\r\n        for caption in captions:\r\n            srt += str(index) + '\\n'\r\n            srt += formatTimeSRT(float(caption[\"start\"])) + ' --&gt; ' + formatTimeSRT(float(caption[\"end\"])) + '\\n'\r\n            srt += caption[\"caption\"] + '\\n\\n'\r\n            index += 1\r\n        return srt.rstrip()\r\n        \r\ndef ConvertToDemilitedFiles(inputCaptions):\r\n        marker = \"&lt;span&gt;\"\r\n        # Convert captions to text with marker between caption lines\r\n        inputEntries = map(lambda c: c[\"caption\"], inputCaptions)\r\n        inputDelimited = marker.join(inputEntries)\r\n        logger.debug(inputDelimited)\r\n        return inputDelimited\r\n \r\ndef DelimitedToWebCaptions(sourceWebCaptions, delimitedCaptions, delimiter, maxCaptionLineLength):\r\n        delimitedCaptions = html.unescape(delimitedCaptions)\r\n        entries = delimitedCaptions.split(delimiter)\r\n        outputWebCaptions = []\r\n        for i, c in enumerate(sourceWebCaptions):\r\n            caption = {}\r\n            caption[\"start\"] = c[\"start\"]\r\n            caption[\"end\"] = c[\"end\"]\r\n            caption[\"caption\"] = entries[i]\r\n            caption[\"sourceCaption\"] = c[\"caption\"]\r\n            outputWebCaptions.append(caption)\r\n        return outputWebCaptions<\/pre>\n<h3 id=\"conclusion\"><strong>Conclusion<\/strong><\/h3>\n<p><span style=\"color: #000000;\">When we upload a text file in our S3 bucket, our Lambda will be triggered, and after execution of our Lambda, we will be able to see SRT files in our S3 bucket Output Folder containing the translated SRT files. This SRT file can be used per the business requirements for further processing, depending on the use case.<\/span><\/p>\n<p><span style=\"color: #000000;\">Refer to \u2018<a href=\"https:\/\/blog.cloudthat.com\/translate-text-different-languages-amazon-translate\/?utm_source=blog-website&amp;utm-medium=text-link&amp;utm_campaign=translate-text-different-languages-amazon-translate\/\" target=\"_blank\" rel=\"noopener\"><strong>Translate Text to Different Languages using Amazon Translate- Part 1<\/strong><\/a>\u2019 for more information about Amazon Translate.<\/span><\/p>\n<h3 id=\"aboutcloudthat\"><strong>About CloudThat<\/strong><\/h3>\n<p><a href=\"https:\/\/www.cloudthat.com\/\"><strong>CloudThat<span style=\"color: #000000;\">\u00a0<\/span><\/strong><\/a><span style=\"color: #000000;\">is\u00a0the official AWS Advanced Consulting Partner, Microsoft Gold Partner, and Training partner helping people develop knowledge on the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build\u00a0a robust\u00a0cloud computing ecosystem by disseminating\u00a0knowledge on technological intricacies within the cloud space.\u00a0Our blogs, webinars,\u00a0case studies, and white papers\u00a0enable all the stakeholders in the cloud computing sphere.<\/span><\/p>\n<p><span style=\"color: #000000;\">If you have any queries about Amazon SageMaker, Natural Language Processing, Hugging Face, or anything related to AWS services, feel free to drop in a comment. We will get back to you quickly. Visit our\u00a0<a href=\"https:\/\/www.cloudthat.com\/consulting\/\" target=\"_blank\" rel=\"noopener\"><strong>Consulting Page<\/strong><\/a>\u00a0for more updates on our customer offerings, expertise, and cloud services.<\/span><\/p>\n<h3 id=\"faqs\"><strong>FAQs<\/strong><\/h3>\n<ol>\n<li><span style=\"text-decoration: underline;\"><strong><span style=\"color: #000000; text-decoration: underline;\">What are the different inputs which Amazon Translate supports?<\/span><\/strong><\/span><\/li>\n<\/ol>\n<p><span style=\"color: #000000;\">Ans. Amazon Translate supports plain text input in UTF-8 format.<\/span><\/p>\n<ol start=\"2\">\n<li><span style=\"text-decoration: underline;\"><strong><span style=\"color: #000000; text-decoration: underline;\">What are the size limits on the Translate API?\u00a0<\/span><\/strong><\/span><\/li>\n<\/ol>\n<p><span style=\"color: #000000;\">Ans. Amazon Translate API calls are limited to 5,000 bytes per API call. Amazon Translate, an asynchronous Batch Translation service, accepts a batch of up to 5 GB in size per API call\u00a0<\/span><\/p>\n<ol start=\"3\">\n<li><span style=\"text-decoration: underline;\"><strong><span style=\"color: #000000; text-decoration: underline;\">Does Amazon Translate provide automatic source language detection?<\/span><\/strong><\/span><\/li>\n<\/ol>\n<p><span style=\"color: #000000;\">Ans. Amazon Translate automatically detects source language using Amazon Comprehend behind the scenes if the source language is unknown.<\/span><\/p>\n<ol start=\"4\">\n<li><span style=\"text-decoration: underline;\"><strong><span style=\"color: #000000; text-decoration: underline;\">Are requests where the source language and the target language are the same charged?<\/span><\/strong><\/span><\/li>\n<\/ol>\n<p><span style=\"color: #000000;\">Ans. No, Requests are not charged if the source language equals the target language.<\/span><\/p>\n","protected":false},"author":267,"featured_media":13287,"parent":0,"comment_status":"open","ping_status":"open","template":"","blog_category":[4046,3606,3607],"user_email":"sanketg@cloudthat.com","published_by":"324","primary-authors":"","secondary-authors":"","acf":[],"_links":{"self":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog\/13128"}],"collection":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/users\/267"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/comments?post=13128"}],"version-history":[{"count":1,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog\/13128\/revisions"}],"predecessor-version":[{"id":41886,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog\/13128\/revisions\/41886"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/"}],"wp:attachment":[{"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/media?parent=13128"}],"wp:term":[{"taxonomy":"blog_category","embeddable":true,"href":"https:\/\/www.cloudthat.com\/resources\/wp-json\/wp\/v2\/blog_category?post=13128"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}