Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math


Learn Machine learning

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Hadoop Introduction

Hadoop Hbase

Hadoop HDFS

Hadoop Hive

Hadoop Map Reduce

Let's learn everything about hadoop HDFS

What is HDFS?

In hadoop, the main data is stored in HDFS. In HDFS the data is stored in multiple nodes and the data is also replicated. It means that, we stored the big data in hadoop HDFS in multiple nodes and also replicate the data so when one node goes down, we can access or get the data by other nodes. HDFS use commodity hardware to process and stored the data. Commodity hardware means low costing hardware. HDFS support write ones and read many pattern.

Some concepts of HDFS

Data blocks:
Data blocks are the minimum size of data. We can read and write this data at a time. Data blocks default size is 128 MB. We can change the block size. If the data size is less then block size then the remaining space will be free.
NameNode is the master or main node of HDFS. It contain metadata. We can say, it control or controller of the HDFS. It doesn't store data, it stores all the files metadata.
NameNode is the master node and DataNode is worker node. NameNode controls the work of DataNodes. Here we have the data blocks. These nodes reports the information about the data blocks. The report is about the data is getting stored.

Advantages of HDFS

1. Here the cost is low
2. We can store huge amount of data
3. It has the streaming data access facility.

Disadvantages of HDFS

1. The Data accessing speed is slow
2. For big the HDFS is useful.

HDFS Architecture

This is master worker architecture. Here master is the Name-Node and workers are low cost commodity hardware or we can say Data-Node. In HDFS there is a single NameNode and multiple DataNodes. The NameNode contain the metadata like replication information, how the data is distributed, etc and in dataNodes the data are stored.
In the image we can see a main switch in yellow color. This switch is connected with multiple racks. In each rack there is a local switch in blue color. This local switch is connect with the global switch. In each rack we can see there is multiple DataNode. In the image, we can see that the data of DataNode1 is replicated in DataNode8 and DataNode9. We can also see that in DataNode9, we have replicated data of DataNode1 and DataNode4. Suppose rack2 or DataNode4 is down. In this case we will get the of DataNode4 from DataNode9.

In the DataNode we have blocks. So the data will be stored in those blocks which are present in the DataNodes. If anyone want to get, the data of these DataNodes goes to the NameNode and then go to the client. In HDFS the data is written once but can be used multiple times.
Task of NameNode:
The NameNode is used to store meta-data and data of DataNodes. NameNode is also used to manage and control file-system namespace and also access of the different client into different data blocks, checking the availability of the DataNodes and also manage the replication of data blocks.
Tasks of DataNode:
DataNodes are low cost hardware where we can store the data. So we can say that DataNodes are the main storage. DataNodes perform operations like sorting the data, replica creation of data, deleting, etc according to the command or instruction of NameNode.
Secondary NameNode:
Look in HDFS there is only one NameNode and multiple DataNode. But there can a secondary NameNode. This secondary NameNode doesn't replace the NameNode but it will give a support to the Primary or main NameNode. We use this nodes to take the checkpoints of the file-system.

HDFS Reading

In HDFS we have the data. Now to read the data at first client will send a request to the NameNode for the metadata. After this, the NameNode send the number blocks where the requested data is stored, those blocks location and the replicas and some other information. This information are needed because there are lot of blocks and the information is huge. So in which blocks the data are present and to get those blocks we need those blocks location also. We have to keep one thing in mind that is, there are lot of replicas of the required information. So we also required that information. After this the client communities will read those DataNodes in the parallel fashion. So when all the data is read, it make a combination or we can say combine all the block and make a original file.

Let's see an image for better understanding:

In the image follow the number 1-6. At first client send open request to the distributed file system. Then the request goes to NameNode, to get the location of blocks. After getting the location client send read request to FS data input stream. After this FS data input stream got that blocks locations and start to read the data. After reading the process get terminated. To get the final output data file, all the data read blocks will get combined.

HDFS writing

To write in HDFS client, at first send request for metadata to the NameNode. NameNode sends number of blocks, blocks location and some other information as a response. After getting the data client split the data into different different blocks. After splitting, then the process starts to send the data into DataNodes. When the DataNode receive a block then its make or create a replica and store the data into another DataNode.

Let's see the process in the form of image.

In the image, follow the steps according to the number. In the first step client send create request to the distributed file system. Then this create request will go to NameNode. After getting the request NameNode will collect all the metadata and send that to the distributed file system. After this the client will divide the data into different chunks and will send write request to the FS data input stream. Now the writing part goes to the DataNodes. So the DataNodes will do their work and then will pass ack packet ass output to the FS Data Input Stream. Then client will send close request to the FS data input stream. After this the Distributed file system will send complete request to the NameNode.

HDFS Commands

At first start Hadoop to run HDFS command.

Command Description
cat This command is used to read the file in HDFS and displays the content of the file on the console.
hadoop fs –cat /path_to_file_in_hdfs
mkdir By this command you can create a new directory inside the give location.
hdfs fs -makdir <path>/dir_name
ls This command will show all the files and folder with their path and also file permissions,group,time,modification,file-size etc.
hdfs fs -ls <path>
rm You can remove a file or directory using this command. Suppose the directory has some elements. In this case we have to use -r option to delete all the internal elements.
hdfs fs -rm <file_path>
hdfs fs -rm -r <directory_name>
put To store data into HDFS from to local disk this command is used. This command takes multiple arguments and all are source except last one because that is the destination path of HDFS.
hdfs fs -put <path1> <path2> <path3><destination>
count This command is used to count the number directories, the number of files inside a given directory and to display the file size of the directory.
hdfs fs -count <path>
cp This command is used to copy a file from one directory to another directory in the HDFS.
hadoop fs -cp <source> <destination>
copyFromLocal This command is used to copy files from our local system to the HDFS. This command has an optional switch –f which is used to replace If that file is already a existing file in the system then you can use -f to update that file means -f will replace the existing file.
hadoop fs -copyFromLocal <local_source> <hdfs_destination>
copyToLocal This command is used to copy files from the HDFS to our local system.
hadoop fs -copyFromLocal <hdfs_destination> <local_source>
get This command is used to store data into local disk from HDFS.This command takes multiple arguments and all are source except last one because that is the detination path of local disk.
hdfs fs -get <path1> <path2> <path3><destination>
mv This command is used to moves the files or directories from the source to a destination in HDFS.
hadoop fs -mv <src_or_path> <destination>

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.