Você está na página 1de 3

SANKHYANA CONSULTANCY SERVICES

Data Driven Decision Science

Chapter : 3
Exploring HDFS
What is HDFS?
HDFS is Hadoop Distributed file system is specially designed file system for storing huge data sets
with cluster of commodity hardware with streaming access pattern.

Introduction to HDFS
 HDFS stores the data on the cluster.
 Files in HDFS are WRITE ONCE and not suitable for random writes.
 HDFS is optimized for large, streaming reads of files and not suitable for random reads.
 HDFS is not a regular file system, it is abstraction layer that sits on top of native unix file
system such as ext3, ext4, xfs etc.

HDFS Architecture
 HDFS is responsible for storing the data on Hadoop cluster.
 You can read the files from HDFS and you can write the files to HDFS.
 Data files are split into blocks before storing on the cluster.
 Typical size of each block is 64MB (Hadoop 1) or 128 MB (Hadoop 2).
 Default block size is 128MB.
 You can configure the block size globally in hdfs – site.xml as follows:
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
 Blocks belongs to one file are stored on different data nodes.
 Each block is replictaed on multiple data nodes to ensure high reliability and fault –
tolerance.
 Default replication is three fold which means that each block exists in three different nodes.
 You can configure the replication factor globally in hdfs – site.xml as follows :
<property>
<name>dfs.replication</name>
<value>2</value>
<property>
 MetaData includes filename, number of blocks, on which data nodes blocks are stored etc
will be stored in NameNode.
 HDFS Architecture is based on Master/Slave Architecture which consist of NameNode and
DataNodes.

Secondary
Master Node NameNode
NameNode

DataNode DataNode DataNode

Slave Nodes

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

NameNode
 NameNode is the Master node which is responsible for storing the meta – data related to
files, blocks that make files and location of blocks in the cluster.
 NameNode must be running all the times.
 NameNode is a critical one point failure node.
 If the NameNode fails then the cluster becomes inaccessible.
 For the very reason, we have secondary NameNode.
Secondary NameNode
 Secondary NameNode performs the periodic checkpoints.
 It performs the following tasks periodically :
-- Downloads the current namenode image and edits log files.
-- Joins them into new image.
-- Uploads the new image back to primary NameNode.
 Secondary NameNode is not exactly a hot backup of the actual NameNode because
DataNodes can not connect to the Secondary NameNode in the case of NameNode failure.
 It is just used for recovery of NameNode in the case of NameNode failure.
DataNode
 It is the Slave Node that stores the blocks of data on its local file system.
 Each DataNode sends Heartbeat and Blockreport to NameNode periodically.
 Receipt of a Heartbeat means that the DataNode is functioning properly.
 A Blockreport contains a list of all the blocks available on a DataNode.

Writing file on the cluster


1) User configures the replication factor (default 3) and block size (default 128MB).
2) User requests the Hadoop client to write file on Hadoop cluster.
Eg: Consider Hello.txt is a file to write.
3) Hadoop client splits the file into blocks.
Eg : Assume that hello.txt is divided into 3 blocks namely B1, B2 & B3.
4) Hadoop Client contacts the NameNode.
5) NameNode checks and returns the available DataNodes to Hadoop Client.
6) Hadoop client sends the first block to the first DataNode. After receiving the block, first
DataNode sends the same block to the next DataNode and so on and forms the replication
pipeline.
7) DataNodes send acknowledgments to the NameNode and Hadoop Client after receiving the
block successfully.
8) Hadoop client repeats the same process for all the other blocks related to the file which user
is writing on the cluster.
9) When all the blocks are written on the cluster, the NameNode stores the meta – data
information.

Reading file from the cluster


1) User provides the filename to the Hadoop client.
2) Hadoop client passes the filename to the NameNode.
3) NameNode sends the following to the Hadoop client :
a. Number of the blocks related to the file.
b. DataNodes where the blocks are available.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

4) After that client, will connect to the DataNodes where the blocks are stored.
5) Hadoop client downloads the blocks from the nearest DataNodes.
6) Once the client gets all the required file blocks, it will combine these blocks to form file.

Fault tolerance strategy


1) When NameNode fails then Secondary NameNode comes into picture. The NameNode then
has to be restored with the help of the merged copy of the NameNode image.
2) DataNode sends a heartbeat message to the NameNode every 3secs to inform the
NameNode that it is alive. If the NameNode doesn't receive a heartbeat message from the
DataNode in 10mins, it considers the DataNode to be dead. It then accesses the replica of
the block in some other DataNode.

Replication Strategy
1) The default replication factor is 3.
2) The cluster is split in terms of racks, where each rack contains DataNodes.
3) 1st replica is placed in the same node where client is running. If it is not free then the 1 st
replica is placed in any node in the same rack.
4) 2nd replica is placed on different rack from the 1st rack.
5) 3rd replica is placed in the same rack as 2nd, but on a different node.

Hadoop Commands
 hdfs fs or hdfs dfs –> Display commands which can work with hdfs.
 hdfs dfs -ls -> Shows the directory listing.
 hdfs dfs -mkdir -> Creates the directory in HDFS. -p : Creates parent directories along the
path.
 hdfs dfs -put -> Copy the file or directory from local filesystem to HDFS.
 hdfs dfs -copyFromLocal -> Do the same work as put command.
 hdfs dfs -moveFromLocal -> Move the file or directory from local filesystem to HDFS.
 hdfs dfs -cat -> Display the contents of the file in HDFS.
 hdfs dfs -get -> Copy the file from HDFS to local file system.
 hdfs dfs -copyToLocal -> Copy the file or directory from HDFS to local filesystem .
 hdfs dfs -rm -> Removes the files.
 hdfs dfs -rmdir -> Removes the directory.
 hdfs dfs -rm -r -> Removes the directory and its contents.
 hdfs dfs -cp -> Copy the file from HDFS to HDFS.
 hdfs dfs -mv -> Move the file from HDFS to HDFS.
 hdfs dfs -setrep <number> <filename> -> Set the replication factor for that particular file.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186