Voiced by Amazon Polly |
What is a data lake?
A data lake is a centralized repository that allows you to store all structured and unstructured data at scale and run flexible analytics such as dashboards, visualizations, big data processing, real-time analytics, and machine learning, to guide better decisions
Source: Amazon Web Services.
Transform Your Career with AWS Certifications
- Advanced Skills
- AWS Official Curriculum
- 10+ Hand-on Labs
Data lake on AWS
Amazon Web Services (AWS) offers several services that can be used to build a data lake. They include:
- Amazon S3: A highly scalable object storage service that can be used to store all your data.
- Amazon EMR: A managed Hadoop and Spark service that can be used to process data in a data lake.
- Amazon Athena: A serverless query service that can be used to analyse data in a data lake.
- Amazon Redshift Spectrum: A fully managed, petabyte-scale data warehouse that can be used to analyse data in a data lake.
Industry-leading financial institutions
Mastercard acquired NuData Security to improve its fraud prevention techniques by using passive biometrics to authenticate account holders’ identities. NuData uses an Amazon S3 data lake to store customer data that it collects and analyzes in real time. By using AWS, NuData is able to aggregate, anonymize, and analyze petabytes of customer data to detect anomalous behavior patterns and protect customers from fraud.
Capital One wanted to leverage machine learning capabilities to provide better fraud detection services for its customers. The bank chose to build a data lake on Amazon S3, enabling it to store and analyse large volumes of data. Using Amazon S3 means the bank is better able to detect and prevent fraud in real time. When suspicious activity occurs, Capital One automatically alerts customers and walks them through how to report instances of fraud.
National Australia Bank (NAB) built its Data Hub data lake to power “Discovery Cloud,” a laboratory for the bank’s data scientists. By building its data lake on AWS, NAB is able to provide full data lineage, access the data in real-time via APIs, and analyse the data using a wide range of AWS or third-party services.
Nasdaq needed to provide greater accessibility to data for both internal users and regulators. By building a data lake on AWS, Nasdaq is able to move an average of 30 billion rows into the cloud everyday (with 60 billion on a peak day), while fulfilling security and regulatory requirements and realizing cost efficiencies.
FINRA Case Study
FINRA (Financial Industry Regulatory Authority) leverages AWS (Amazon Web Services) to build and manage its data lake, a central repository for storing and analysing vast amounts of trade data. This data lake enables FINRA’s analysts to efficiently investigate potential fraud, market manipulation, and insider trading.
Key aspects of FINRA’s AWS data lake:
Data Storage:
FINRA utilizes Amazon S3 (Simple Storage Service) for storing raw, unstructured data in its data lake, allowing for scalability and flexibility.
Data Cataloging and Transformation:
Amazon Glue is used for data cataloging, metadata management, and ETL (Extract, Transform, Load) processes, ensuring data quality and consistency.
Data Analysis:
Amazon Athena, a serverless query engine, allows analysts to perform SQL queries directly on the data lake, facilitating efficient data exploration and discovery.
Benefits:
The data lake enables FINRA to analyze years of historical market data quickly, identify potential violations, and support regulatory oversight effectively.
Scalability and Security:
AWS services provide the scalability and security infrastructure necessary to handle the massive volume of data FINRA processes, according to a blog post on Amazon Web Services.
Data Access:
According to a FINRA document, access to specific datasets may require signing a user agreement, and some are immediately accessible. Firm data access is controlled by the firm, and users should contact their firm’s account administrator for access.
Data Lakes and Analytics on AWS
Source: Google Images
Analytics category | AWS service |
Streaming | Amazon Data Firehose
Amazon Kinesis Amazon Managed Service for Apache Flink Amazon MSK |
Data lakehouse, Data warehouse, Data lake | SageMaker Lakehouse
Amazon Redshift Amazon S3 data lake |
Data Processing | Athena
Amazon EMR AWS Glue Amazon Managed Workflows for Apache Airflow (Amazon MWAA) |
Business intelligence | QuickSight |
Search analytics | OpenSearch Service |
Data and AI governance | Amazon DataZone
SageMaker Catalog |
Statistics
We get 3x better price-performance delivered by Amazon Redshift compared to other cloud data warehouses.
As compared to open-source Apache Spark, 3.9x better performance is delivered by Amazon EMR.
Trillions of requests are processed per month by OpenSearch Service.
Hundreds of millions of data integration jobs run on AWS Glue every month.
Conclusion
In this blog, we got introduced to Data Lake on AWS. We looked at various services and components of data lake and analytics solutions on AWS. The key aspect of data lake is the enormous benefit it offers over existing solutions. We looked at different success stories along with references to numeric statistics to support the same.
Earn Multiple AWS Certifications for the Price of Two
- AWS Authorized Instructor led Sessions
- AWS Official Curriculum
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront, Amazon OpenSearch, AWS DMS, AWS Systems Manager, Amazon RDS, and many more.
WRITTEN BY Vivek Kumar
Comments