Disaster Recovery (DR) is an important aspect of any cloud deployment. In the words of Amazon’s CTO Vernal Vogel’s, “Everything fails, all the time”. It is possible that an entire data center or region of the cloud provider goes down. This has already happened to most cloud providers like Amazon AWS and Microsoft Azure and will surely happen again in future. Cloud providers like Amazon AWS and Microsoft Azure will readily suggest that you have a Disaster Recovery and Business Continuity strategy that spans across multiple regions, so that even if a single geographic region is down, you can continue working off of another region. This sounds good in theory, but there are several flaws in the methodology of using the same region of a single provider. AWS & Azure regions are no longer fully independent creating a global failure scenario likely. Also Data is better protected from accidental deletions when stored in multiple clouds. More importantly Data is better protected from malicious deletions when on multiple clouds. The client wanted to make sure that a single person cannot delete all the copies of important data. This was possible if a single cloud is used.
CEO of the company was very worried about an incident that had happened, when a disgruntled employee deleted AWS RDS database and its backup copies. Luckily, most of the data was recovered from an accidental copy that someone had made. Thus the management wanted that no single person should be able to delete all the data. Also the management wanted business continuity and disaster recovery when AWS was down.
The business also didn’t wanted substantial costs associated with these changes.
1. Moving data from one cloud provider like AWS to Azure regularly, required a lot of automation. This would take substantial human time and thus cost, if this was done manually each time. Thus one of the objective was to automate as much as possible.
2. Also, as not to substantially increasing costs was one of the criteria, we didn’t want a lot of things running in Azure.
3. We wanted a warm DR setup where majority of things could be started from scratch as and when needed.
We wanted to use as many cloud native services as possible. For example, on the AWS side we used S3 for storage, we wanted to make sure that we don’t create disks, but instead use Azure blob service.
Architecture Diagram and Designs
The current architecture of the system looked like below. The client used Route53 to route the DNS, lets say www.sample.com to and Elastic Load Balancing (ELB), which in turn had EC2 instances running with auto-scaling. The DB layer was Relational Database Service (RDS) using MySQL in Multi-AZ mode. Some very important data was stored in an S3 bucket, which was distributed to the clients using CloudFront distribution.
We first set out to automatically replicate the RDS MySQL DB in Azure, as that was considered the most crucial piece. We designed an automated process as below:
1. Daily, AWS Workflow Service (SWF) will automatically create a Read Replica for the database.
2. SWF will then take the dump of the databases that are on that RDS machine.
3. On the Azure side, it starts a stopped Azure VM machine, which has MySQL running on it.
4. The dump is then copied to Azure Storage and from there copied to MySQL on Azure VM.
5. SWF then releases the Read Replica and stops the Azure VM to save costs.
6. In case RDS was down, a VPN site-to-site tunnel will be generated between Azure VNET and AWS VPC, and the app servers running on AWS can use the MySQL running on Azure VM.
Then we set out to replicate the data that is present on S3 to Azure Blob Storage.
1. In this we setup the S3 Bucket Events to generate notifications when a new object is added to our bucket, and when existing object is changed or deleted.
2. This notification is configured to add an entry to Simple Queue Service (SQS).
3. There is then a Sync Agent running on an EC2 instance that dequeue the SQS queue and for each object copies the object from S3 bucket to Azure Blob Storage container.
4. This blob storage is then fronted by Azure CDN.
5. In case that S3 or CloudFront is down, we can have the application running on AWS change the CloudFront CDN endpoint changed to Azure CDN endpoint.
The third thing we automated was to move the Amazon Machine Image (AMI) of applications running on AWS to Azure.
1. AWS SWF process will be triggered every time a new AMI is created in the AWS account.
2. This SWF process will use AWS Import/Export service and convert the AMI to VHD format.
3. This format is then copied to Azure blob storage.
4. Then tools are used to convert the uploaded VHD to Azure bootable Image which can be used to start an Azure VM.
We were able to achieve automatic replication of all the aspects of infrastructure from AWS to Azure. Even if S3, RDS, CloudFront, etc. services were down the system could just substitute the Azure services with AWS ones. Even if entire AWS was down, a Azure ARM template could start the entire system into Azure within a few mins with minimal human intervention. We achieved substantial DR capabilities without increasing the monthly costs much.