Analyze and Interpreting Genomic Data using Amazon Omics

Introduction

Amazon Omics is a cloud-based suite of tools and services provided by Amazon Web Services (AWS) designed to facilitate the processing, storage, and analysis of large-scale genomic data sets. It is designed for researchers and scientists who need to work with genomic data, as it provides a scalable, flexible, and secure platform for data processing and storage.

Amazon Omics also offers several visualization tools that allow researchers to view and explore data sets in a user-friendly manner. Amazon QuickSight, for example, is a business intelligence tool that can be used to create and share interactive dashboards and reports.

To ensure data security, Amazon Omics is designed to meet regulatory requirements, such as the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR). AWS also provides compliance resources and services to help researchers and scientists ensure their work complies with regulations and industry standards.

Amazon Omics provides various solutions that can be tailored to meet specific needs. For example, AWS offers Amazon Redshift, a data warehousing solution that can be used to store and manage genomic data sets, and Amazon Aurora, a relational database that is optimized for performance and scalability. Additionally, AWS provides various tools and services for application development, data integration, and data analysis.

One of the primary benefits of Amazon Omics is its scalability. AWS can handle extremely large data sets, which can be accessed through Amazon Elastic Compute Cloud (EC2). This fundamental component enables researchers to easily provision and scale compute resources to handle the large-scale data processing and analysis required in genomics and other “omic” research.

Amazon Simple Storage Service (S3), by using S3, bioinformatics researchers can benefit from the highly scalable and reliable service, which can handle large data sets and be accessed from anywhere. Additionally, S3 integrates with other AWS services, such as AWS Lambda and Amazon EMR, which can perform various computational analyses and processing tasks on the stored data.

Amazon CloudWatch can monitor the performance of genomics applications and services, set alarms, and automate actions based on predefined metrics.

AWS CodePipeline This is a fully managed CI/CD service used to build, test and deploy code changes. It can be used to deploy updates to genomics applications and services.

These tools allow researchers to store, manage, and access data sets quickly and easily, regardless of size.

Another significant feature of Amazon Omics is its machine learning and artificial intelligence (AI) capabilities. AWS provides several AI and machine learning solutions, such as Amazon SageMaker, Amazon Comprehend Medical, and Amazon Rekognition, which can be used to analyze and extract insights from genomic data. For example, researchers can use machine learning to identify patterns and anomalies in genomic data sets that could lead to the discovery of new treatments for diseases.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Components of the Amazon Omics

The three key components of the Amazon Omics console are: Amazon Omics Storage, Omics Analytics, and Omics Workflows

Amazon Omics Storage:

Omics Storage can import genomics files, import reference genomes, and generate sequence or reference stores. After creating stores and importing genomic data, you can access and examine your sequence data.

On the Amazon Omics console

Choose Get started with Omics and choose Reference genomes from the Genomics data storage options.

Storage

Choose a previously imported reference genome or import a new one. If you have not imported a reference genome,

Select Import reference genome in the top right.

omics2

On the Create reference genome import job page, choose either the Quick create or Manual create option to create a reference store, and then provide the following information.

Reference genome name – Enter a unique name for the store.
Description (optional) – Enter a description of the reference store.
Reference from Amazon S3 – Select your reference sequence file in fasta format in an Amazon S3 bucket.

omics3

IAM Role – Create a role with access to the reference genome.
Add Tags and click on the import reference genome

omics4

After importing the reference genome, Create a sequence store with a unique name by selecting whether you want data encryption to be owned and managed by AWS or a customer managed CMK.

2. Sequence Store

Sequence store is created as a data store that holds genome sequence files.

On the Sequence store page, choose Import genomic files.
On the Specify import details page, you need to provide the following information.
IAM role – The role that can access the genomic files on S3.
Reference genome – reference genome for the genomics data.
On the Specify import manifest page, specify the following information Manifest file. The manifest file is a JSON or YAML file that describes essential information of the genomics data and Click Create import job.

Omics Analytics:

Omics Analytics can build variant and annotation stores. Following the creation of the stores and the importation of the data.

1. Create a variant store

On the Amazon Omics home page, click on Variant stores in the upper left corner of the screen under Analytics.

On the Create variant store page, provide the following information.

Variant store is created as a data store that holds your genome variant files.

omics5

Variant store name – Enter a unique name for this store.
Description (optional) – Enter a description of this variant store.
Reference genome – The reference genome for this variant store.
Data Encryption – Choose whether you want data encryption to be owned and managed by AWS or yourself and create a variant store.

omics6

After creating a variant store

2. Create an annotation store.

An annotation store is created for a data store to hold your genomic annotation files and supports either TSV/CSV, VCF, or GFF files.

On the Create annotation store page, provide the following information

Annotation store name – Enter a unique name for this store.
Description (optional) – enter a description of this reference genome.
Data format and schema details – Select the data file format and upload the schema definition for this store.
Reference genome – The reference genome for this annotation.
Data Encryption – Choose whether you want data encryption to be owned and managed by AWS or yourself, and choose Create annotation store.

Omics Workflow

Workflow

A workflow comprehensively describes an entire process with guidelines and tool references. WDL or Nextflow can be used to express workflow definitions.

To create a workflow

On the Amazon Omics home page, in the upper left corner of the screen Select Workflows.
On the Create workflow page, provide the following information
1. Workflow main definition path – The file path directs to the workflow definition.
2. Workflow name – Enter a distinctive name for this workflow.
3. Run storage capacity(optional) – The default amount of storage needed for this workflow. The default is 1.2 TB. This storage is deleted after the run completes.
4. Workflow definition – The Amazon S3 path to the workflow definition zip. Choose whether it is written in Nextflow or WDL from the drop-down box.
Choose Next.
On the Add workflow parameters page, provide the workflow parameters. You can either upload a JSON file that specifies the parameters or manually enter your workflow parameters.
Choose Create workflow.

omics7

omics8

2. Start a run

On the Create run page

You can create a new run without creating a workflow. It monitors run status, vCPU, and memory requested and access logs through CloudWatch.

omics9

Workflow ID – The workflow ID associated with this run.
Run name – A distinctive name for this run.
IAM role – The IAM role that can access the data locations referenced in your parameter values. It should also contain a Cloud Watch policy for the service to publish logs to your Cloud Watch account.
Run priority – The priority of this run. Higher numbers specify a higher priority, and the highest priority tasks are run first.
Run storage capacity – The amount of temporary storage needed for the run. By default, the run storage capacity set for the workflow will be selected. You can select a different run storage capacity for your run.
Select S3 output destination – The S3 location where the run outputs will be saved and click Next.
On the Add parameter values page, provide the workflow parameters. You can either upload a JSON file that specifies the parameters or manually enter your workflow parameters. And choose Next.

Provide the run group details on the Add run groups and tags page. Optionally, you can provide up to 50 tags for this run.
Choose Create run.

3. Create a run group

omics10

Run group specifies the maximum number of vCPUs for running the group and the maximum time for a run.

Click on Create run group.
On the Create run group details page, provide
1. Run group name – Enter a name for this run group.
2. Max vCPU for concurrent runs – The maximum number of vCPUs running parallel across the run group.
3. Max run time (hrs) per run – The maximum time a run can be active.
Click Create run group.

Conclusion

Hence, Amazon Omics provides a powerful suite of tools for analyzing and interpreting genomic data. Its wide range of analytics transformations enables researchers to investigate various aspects of genomics research.

Overall, Amazon Omics is a valuable suite of tools and services for researchers and scientists working with genomic data. Its scalability, machine learning capabilities, and visualization tools make it a powerful solution for storing, analyzing, and gaining insights from large-scale genomic data sets. Moreover, it provides researchers and scientists with a secure, compliant, and flexible platform.

Upskill Your Teams with Enterprise-Ready Tech Training Programs

Team-wide Customizable Programs
Measurable Business Outcomes

Learn More

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What types of genomics research can be performed using Amazon Omics?

ANS: – Amazon Omics can be used for various genomics research, including genome sequencing, transcriptomics, epigenomics, and metagenomics. It can also be used for functional genomics, such as gene expression analysis and gene editing.

2. Can we use our data with Amazon Omics?

ANS: – Yes, users can upload their data to Amazon Omics using the Amazon S3 service. Users can also use AWS Glue to integrate data from different sources, such as public genomics databases or other AWS services.

WRITTEN BY Abhilasha D

Abhilasha D works as a Research Associate-DevOps at CloudThat. She is focused on gaining knowledge of the cloud environment and DevOps tools. Abhilasha is interested in learning and researching emerging technologies and is skilled in dealing with problems in a resourceful manner.