HDFS and Its Architecture

Sonu Lakra
2 min readJan 24, 2023

HDFS

  1. A storage layer for Hadoop is known as HDFS.
  2. Instead of relying on a single server, HDFS can store enormous amounts of data across several servers. The data is replicated across several hosts to achieve reliability.
  3. Low-cost hardware was used in its design.
  4. It Makes data accessible from several Hadoop clusters.
  5. High throughput and fault tolerance are available.
  6. It Follows Master-Slave Architecture

NameNode

  1. The NameNode serves as HDFS’s central node.
  2. The Master is another name for NameNode.
  3. NameNode just keeps the HDFS metadata.
  4. The dataset or actual data are not kept on a NameNode.
  5. Any given file in HDFS has a NameNode that is aware of the list of Blocks and their locations. NameNode can create the file from blocks using this knowledge.
  6. Given how important NameNode is to HDFS, the Hadoop/HDFS cluster becomes unreachable and is deemed to be down when NameNode is down.
  7. In a Hadoop cluster, NameNode is a single point of failure.

DataNode

  1. The actual data must be kept in HDFS by the DataNode.
  2. The Slave is another name for DataNode.
  3. DataNode and NameNode are constantly communicating.
  4. A DataNode announces itself to the NameNode when it first starts up, along with a list of the blocks it is in charge of.
  5. The cluster’s availability and the data’s availability are unaffected when a DataNode goes down.
  6. Hard disc space is typically provisioned with a large amount on DataNode because the DataNode is where the actual data is kept.
  7. DataNode notifies NameNode with HEARTBEATS on a regular basis

Secondary NameNode

  1. We presume that it is a backup node based on its name, however, it is not.
  2. The Secondary NameNode maintains the checkpoint on the NameNode, constantly reads the edit logs from the NameNode after a predetermined interval and applies it to the Secondary NameNode’s fsimage copy. The most recent HDFS state will be present in the fsimage file in this fashion.

How the data is actually stored in DataNode?

  1. The file is divided into smaller parts known as data blocks by HDFS in Hadoop.
  2. These blocks are kept in storage as independent units.
  3. These HDFS data blocks have a default size of 128 MB.
  4. These blocks are distributed over several slave machines by Hadoop, and the master machine keeps the metadata pertaining to the placement of the blocks.
  5. A file’s blocks are all the same size, with the exception of the last one (if the file size is not a multiple of 128)

How Hadoop cope up with Data Node Failure

  1. To achieve fault tolerance, the blocks are replicated.
  2. It is configurable, although the default replication factor is 3.

HDFS High Availability

  1. In an HDFS cluster , NameNode is the Single Point of Failure(SPOF) prior to Hadoop 2.0 .
  2. The availability of an HDFS cluster depends upon the availability of the NameNode.
  3. There are primarily 2 ways in which this kind of circumstance affects the HDFS cluster’s overall availability:
  4. The cluster would not become operational again until an operator restarted the NameNode in the event of an unforeseen occurrence, such as a machine crash.
  5. Cluster downtime would arise from planned maintenance such software or hardware upgrades on the NameNode computer.

--

--