Establishing secure connectivity in-between Azure/AWS and point-to-site to AWS cloud

About Client

The client is an online gold loan company. They have taken the entire gold loan process online through their branchless model to make things more convenient. Client ensures that each of their customers can access gold loans that are fair, fast, and flexible.

Problem Statement

Many micro-services applications were required a robust and world-class infrastructure deployed on AWS cloud to deliver all gold loan-related facilities in a simplest way to their customers and achieve these facilities they were needing and focusing on fault-tolerant and HA (highly available) applications hosted on AWS cloud. They also needed proper monitoring and alerting set up for their infrastructure and applications. Due to bad architecture, the cost was three times higher than the actual cost. Even after paying so much, a lot of things were manual.

Business Objectives

  1. The client aims to simplify the process of taking gold loans.
  2. Connects with users online through their branchless model to make things more convenient.
  3. Provide online gold loans at the lowest interest rates in the market.
  4. The client aims at a delightful customer experience for their borrowers with a minimal processing fee, personalized support, and on-demand doorstep pickup.
  5. Provide the complete support to the customer through out the loan lifecycle.

Technical Objectives

  1. Application to be deployed on highly available and fault tolerant infrastructure.
  2. Implement a simple and easy deployable automated deployment pipelines, which makes the development team independent from the DevOps team and to make the process more reliable.
  3. Configure all internal applications in such a way that they are only accessible within the VPN network.
  4. Optimize the networking issues in the current infrastructure
  5. Audit the current infrastructure and implement cost optimization.
  6. Implement orchestration tools like Kubernetes for cost and resource optimization.
  7. Implement VPN set up to improve security and allow employees to work from home.
  8. Configuring and maintaining connectivity between client and various banks using appropriate VPN setup.
  9. Implement Prometheus and Grafana for better infrastructure monitoring.
  10. Implement New Relic for better monitoring of production infrastructure and applications.
  11. Implement SumoLogic for log aggregation of production resources.
  12. Implement one Click solution to deploy infrastructure for the test, dev, and prod environment.
  13. Implement PagerDuty as an alerting system and have a good on-call service integrated with Prometheus and New Relic.
  14. Manage site to site VPN between on-premises and AWS within the organization.
  15. Convert all the EC2 hosted micro-services applications to containerized applications.

Design Factors

  1. To ensure highly available and fault-tolerant Production env and ASG with different scaling policies on different availability zones.
  2. Automation using Lambda scripts for verifying all the security group rules.
  3. Configure standard AMIs and enforce their use while launching the instances. If not so then a notification needs to be sent to Slack.
  4. Automate the process of installing Prometheus exporters, New Relic agent, and SumoLogic agents using Ansible playbooks.
  5. Automate Grafana dashboard backups and store them to S3.
  6. Configure all the production applications in private subnets with application load balancers in the public subnet.
  7. Configure production and non-production bastion hosts and allow access to respective servers.
  8. Use AWS Parameter Store for storing all the Jenkins parameters.
  9. Use AWS Secret Manager for storing all the credentials and restrict access.
  10. There were different domains for internal, external, production, and non-production purpose. Manage all the hosted zones in Route 53.
  11. Configure Jenkins pipeline for Dev, UAT, Beta, and Prod environment.
  12. Customize the Slack notifications for Prometheus alerts and provide a link to the Grafana dashboard.
  13. Configure Alert manager for sending alerts to emails, Slack, and PagerDuty.
  14. Configure dev endpoint monitoring in Prometheus Blackbox exporter and prod endpoint monitoring in New Relic Synthetics.
  15. Manage and integrate New Relic and Pagerduty for proper handling and routing of alerts.
  16. Configure APM, Synthetics, Bowser, Mobile, and Infrastructure services in New Relic with alerting.
  17. Use Amazon Data Lifecycle Manager for EC2 snapshots.
  18. Configuration of SFTP and SES services for flowing of data into AWS.
  19. Maintaining Elastic Beanstalk and automating deployment process of deploying environments for couple of micro-services.
  20. Automated aggregation of all the infrastructure and application logs to SumoLogic.
  21. Ensuring that all the AWS resources are tagged correctly with all the importance keys and values.
  22. Analysing the over provisioned resources and taking action towards cost optimization.
  23. Setting up the lifecycle policy for backups and snapshots

Amazon Services Used

  1. Amazon EC2 
  2. VPC 
  3. VPN 
  4. Elastic Load Balancer 
  5. Auto Scaling 
  6. Route 53 
  7. RDS 
  8. CloudFront
  9. Elastic file system
  10. Elastic Container Service
  11. IAM
  12. CloudWatch
  13. CloudTrail
  14. Elastic Search service
  15. Elastic Kubernetes service
  16. Lambda
  17. Secret manager
  18. SES
  19. SFTP
  20. SQS
  21. SNS
  22. S3
  23. SSM
  24. WAF
  25. Compute Optimizer
  26. Elasticache

Outcomes

We have built a highly secure and robust infrastructure to handle massive traffic. We have provided an enhanced monitoring and alerting solution using Prometheus, Grafana, New Relic, SumoLogic, and PagerDuty, which helped improve the performance by reducing overall cost by 45%. 

Lessons Learned

  1. Based on the problems they were facing with their old infrastructure; we took the necessary steps to resolve the issues. It helped us in having a good understanding of resolving issues of a lousy infrastructure in real-time with all the applications up and running.
  2. We Had a good understanding of tools like Prometheus, Grafana, New Relic, SumoLogic, and PagerDuty.
  3. We also learnt how to work with a remote non-tech team of a bank while setting up a VPN connection with their network. Due to compliance issues, we didn’t have access to their network and had to guide their team for working on the setup.