Voiced by Amazon Polly |
What is a data lake?
A data lake is a centralized repository that allows you to store all structured and unstructured data at scale and run flexible analytics such as dashboards, visualizations, big data processing, real-time analytics, and machine learning, to guide better decisions
Source: Amazon Web Services.
Transform Your Career with AWS Certifications
- Advanced Skills
- AWS Official Curriculum
- 10+ Hand-on Labs
Data lake on AWS
Amazon Web Services (AWS) offers several services that can be used to build a data lake. They include:
- Amazon S3: A highly scalable object storage service that can be used to store all your data.
- Amazon EMR: A managed Hadoop and Spark service that can be used to process data in a data lake.
- Amazon Athena: A serverless query service that can be used to analyse data in a data lake.
- Amazon Redshift Spectrum: A fully managed, petabyte-scale data warehouse that can be used to analyse data in a data lake.
Industry-leading financial institutions
Mastercard acquired NuData Security to improve its fraud prevention techniques by using passive biometrics to authenticate account holders’ identities. NuData uses an Amazon S3 data lake to store customer data that it collects and analyzes in real time. By using AWS, NuData is able to aggregate, anonymize, and analyze petabytes of customer data to detect anomalous behavior patterns and protect customers from fraud.
Capital One wanted to leverage machine learning capabilities to provide better fraud detection services for its customers. The bank chose to build a data lake on Amazon S3, enabling it to store and analyse large volumes of data. Using Amazon S3 means the bank is better able to detect and prevent fraud in real time. When suspicious activity occurs, Capital One automatically alerts customers and walks them through how to report instances of fraud.
National Australia Bank (NAB) built its Data Hub data lake to power “Discovery Cloud,” a laboratory for the bank’s data scientists. By building its data lake on AWS, NAB is able to provide full data lineage, access the data in real-time via APIs, and analyse the data using a wide range of AWS or third-party services.
Nasdaq needed to provide greater accessibility to data for both internal users and regulators. By building a data lake on AWS, Nasdaq is able to move an average of 30 billion rows into the cloud everyday (with 60 billion on a peak day), while fulfilling security and regulatory requirements and realizing cost efficiencies.
FINRA Case Study
FINRA (Financial Industry Regulatory Authority) leverages AWS (Amazon Web Services) to build and manage its data lake, a central repository for storing and analysing vast amounts of trade data. This data lake enables FINRA’s analysts to efficiently investigate potential fraud, market manipulation, and insider trading.
Key aspects of FINRA’s AWS data lake:
Data Storage:
FINRA utilizes Amazon S3 (Simple Storage Service) for storing raw, unstructured data in its data lake, allowing for scalability and flexibility.
Data Cataloging and Transformation:
Amazon Glue is used for data cataloging, metadata management, and ETL (Extract, Transform, Load) processes, ensuring data quality and consistency.
Data Analysis:
Amazon Athena, a serverless query engine, allows analysts to perform SQL queries directly on the data lake, facilitating efficient data exploration and discovery.
Benefits:
The data lake enables FINRA to analyze years of historical market data quickly, identify potential violations, and support regulatory oversight effectively.
Scalability and Security:
AWS services provide the scalability and security infrastructure necessary to handle the massive volume of data FINRA processes, according to a blog post on Amazon Web Services.
Data Access:
According to a FINRA document, access to specific datasets may require signing a user agreement, and some are immediately accessible. Firm data access is controlled by the firm, and users should contact their firm’s account administrator for access.
Data Lakes and Analytics on AWS
Source: Google Images
Analytics category | AWS service |
Streaming | Amazon Data Firehose
Amazon Kinesis Amazon Managed Service for Apache Flink Amazon MSK |
Data lakehouse, Data warehouse, Data lake | SageMaker Lakehouse
Amazon Redshift Amazon S3 data lake |
Data Processing | Athena
Amazon EMR AWS Glue Amazon Managed Workflows for Apache Airflow (Amazon MWAA) |
Business intelligence | QuickSight |
Search analytics | OpenSearch Service |
Data and AI governance | Amazon DataZone
SageMaker Catalog |
Statistics
We get 3x better price-performance delivered by Amazon Redshift compared to other cloud data warehouses.
As compared to open-source Apache Spark, 3.9x better performance is delivered by Amazon EMR.
Trillions of requests are processed per month by OpenSearch Service.
Hundreds of millions of data integration jobs run on AWS Glue every month.
Conclusion
In this blog, we got introduced to Data Lake on AWS. We looked at various services and components of data lake and analytics solutions on AWS. The key aspect of data lake is the enormous benefit it offers over existing solutions. We looked at different success stories along with references to numeric statistics to support the same.
Earn Multiple AWS Certifications for the Price of Two
- AWS Authorized Instructor led Sessions
- AWS Official Curriculum
About CloudThat
CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.
WRITTEN BY Vivek Kumar
Comments