Big Data, Cloud Computing

4 Mins Read

HDFS: Distributed Big Data Storage and Management

Overview

Hadoop Distributed File System (HDFS) is a distributed file system that stores and manages large datasets across multiple machines.

HDFS stores data in a distributed manner by dividing it into blocks of fixed size and replicating each block across multiple DataNodes. By default, HDFS uses a block size of 128 MB, which can be configured based on the application requirements.

The replication factor determines the number of replicas created for each block. By default, HDFS uses a replication factor 3, meaning each block is replicated across 3 DataNodes. However, new data can be appended to an existing file. This model ensures data consistency and makes managing data across multiple nodes easy.

Distributed File System

A distributed file system is a file system that is distributed across multiple systems and connected by a network. These multiple systems or machines are also known as nodes. Data here is divided into smaller units known as blocks.

DFS allows the users to access and manage the data stored on the remote nodes as if stored locally. DFS provides benefits over traditional centralized file systems, such as scalability, fault tolerance, and improved performance.

DFS serves the same purpose as a file system that is available with our machine like for Windows, we have NTFS, and for Mac, we have HFS. Talking about how the DFS differs from these file systems in that it allows storing the data in multiple nodes/machines and is extremely helpful when dealing with tons of data. On top of that, it displays the data so that it is present or stored locally.

Below are listed some factors as to why we need distributed systems:

  1. Scalability: A distributed file system allows data to be distributed across multiple machines, making it possible to store and manage large datasets that a single machine cannot handle. This enables organizations to scale their data storage and processing capabilities as their data grows.
  2. Fault-tolerance: In a distributed file system, data is replicated across multiple machines, which means that even if one or more nodes fail, data is still accessible and available to users. This makes distributed file systems more reliable and fault-tolerant than traditional centralized ones.
  3. Improved performance: By distributing data across multiple machines, a distributed file system can improve read and write performance by enabling parallel access to data. This means multiple users can access and manipulate data simultaneously without affecting the system’s overall performance.

After discussing the concept of HDFS and DFS, we will now look at the components and the architecture of the HDFS. The architecture is mainly a master-slave model wherein NameNode acts as a master and DataNode as a slave. HDFS is optimized to run on hardware, providing fault-proof and highly available application storage. Now we will have a brief on each component.

  1. NameNode: It manages and maintains the DataNodes, also known as slave nodes. It also records the metadata of the files stored in the cluster. Regularly it receives the health of all the DataNodes in the cluster.
  2. DataNode: These are also regarded as slave machines. It is the only place where the data is stored, and it is responsible for offering the clients’ read and write requests.
  3. Secondary NameNode: It is an important component that aims to contact the NameNode periodically. It pulls off the copy of metadata out of the NameNode. Checkpointing is something that is performed by the secondary node.

The NameNode and DataNodes communicate using the Hadoop Distributed File System Protocol (HDFS) based on Remote Procedure Calls (RPC). The NameNode communicates with the DataNodes to store, retrieve and replicate data blocks. At the same time, the DataNodes periodically send heartbeats and block reports to the NameNode to provide health status and block availability information.

Pioneers in Cloud Consulting & Migration Services

  • Reduced infrastructural costs
  • Accelerated application deployment
Get Started

Use Cases

Big Data Processing and Storage Applications such as Analytics, Data Warehousing, and Machine Learning are some of the prominent use cases.

  1. Log Processing: Used for processing log data generated by applications and devices. It allows huge volumes of data to be stored cost-effectively for data analysis.
  2. Data Warehousing: It is used as a storage layer for data warehousing applications, wherein large amounts of structured and unstructured data are analyzed.
  3. Machine Learning: HDFS is a data source for machine learning applications requiring large volumes of training data.

Features of HDFS

  1. Distributed Storage: If we have a file sized 10 TB and access it through a machine from a cluster of 10 machines. Now it seems that you have logged into a machine with a large 10 TB or more storing capacity, but that huge file is distributed over the ten machines of that cluster. Thus, it is not bound to the physical boundaries of the 1 machine.
  2. Parallel Computation: As we are aware that the data is distributed across several machines depending upon the number of machines in the cluster. Such a feature allows us to leverage the parallel computation capability since N-machines of some configuration would be working simultaneously for the same task.
  3. Data Integrity: HDFS constantly checks the integrity of data stored against its checksum.
  4. High Throughput: Here, we can reduce the processing time from X minutes to a mere X/N minute depending upon the number of machines in the cluster, as all the machines were working in parallel. Thus, it achieved high throughput.

Conclusion

Through this blog, we tried to have an intuition over HDFS and its features. It is a powerful and scalable distributed file system. It is fault-proof and a reliable architecture that ensures that stored data is always available with minimum risk of data loss.

Making IT Networks Enterprise-ready – Cloud Management Services

  • Accelerated cloud migration
  • End-to-end view of the cloud environment
Get Started

About CloudThat

CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft Gold Partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.

Drop a query if you have any questions regarding HDFS and I will get back to you quickly.

To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.

FAQs

1. What are the other types of File Systems?

ANS: – The most common File System implementations are:

  • Windows Distributed File System
  • Network File System (NFS)
  • Server Message Block (SMB)
  • Google File System (GFS)
  • Lustre
  • GlusterFS
  • MapR File System
  • Ceph

2. What is the role of the DataNode in HDFS?

ANS: – The DataNode is responsible for storing and retrieving the data from the file system.

3. What is the Secondary NameNode in HDFS?

ANS: – The Secondary NameNode is not a backup or failover NameNode, and it is responsible for performing periodic checkpoints of the namespace of the NameNode.

4. How does HDFS ensure fault tolerance?

ANS: – HDFS ensures fault tolerance by maintaining multiple replicas of each data block across multiple DataNodes.

WRITTEN BY Parth Sharma

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!