Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Hadoop Introduction
Hadoop Hbase
Hadoop HDFS
Hadoop Hive
Hadoop Map Reduce
In hadoop, the main data is stored in HDFS. In HDFS the data is stored in multiple nodes and the data is also replicated. It means that, we stored the big data in hadoop HDFS in multiple nodes and also replicate the data so when one node goes down, we can access or get the data by other nodes. HDFS use commodity hardware to process and stored the data. Commodity hardware means low costing hardware. HDFS support write ones and read many pattern.
Data blocks:
Data blocks are the minimum size of data. We can read and write this data at a time. Data blocks default size
is 128 MB. We can change the block size. If the data size is less then block size then the remaining space
will be free.
NameNode:
NameNode is the master or main node of HDFS. It contain metadata. We can say, it control or controller of the
HDFS. It doesn't store data, it stores all the files metadata.
DataNode:
NameNode is the master node and DataNode is worker node. NameNode controls the work of DataNodes. Here we have
the data blocks. These nodes reports the information about the data blocks. The report is about the data is
getting stored.
1. Here the cost is low
2. We can store huge amount of data
3. It has the streaming data access facility.
1. The Data accessing speed is slow
2. For big the HDFS is useful.
This is master worker architecture. Here master is the Name-Node and workers are low cost commodity hardware
or we can say Data-Node. In HDFS there is a single NameNode and multiple DataNodes. The NameNode contain the
metadata like replication information, how the data is distributed, etc and in dataNodes the data are
stored.
In the image we can see a main switch in yellow color. This switch is connected with multiple racks. In each
rack there is a local switch in blue color. This local switch is connect with the global switch. In each rack
we can see there is multiple DataNode. In the image, we can see that the data of DataNode1 is replicated in
DataNode8 and DataNode9. We can also see that in DataNode9, we have replicated data of DataNode1 and
DataNode4. Suppose rack2 or DataNode4 is down. In this case we will get the of DataNode4 from DataNode9.
In the DataNode we have blocks. So the data will be stored in those blocks which are present in the DataNodes.
If anyone want to get, the data of these DataNodes goes to the NameNode and then go to the client. In HDFS the
data is written once but can be used multiple times.
Task of NameNode:
The NameNode is used to store meta-data and data of DataNodes. NameNode is also used to manage and control
file-system namespace and also access of the different client into different data blocks, checking the
availability of the DataNodes and also manage the replication of data blocks.
Tasks of DataNode:
DataNodes are low cost hardware where we can store the data. So we can say that DataNodes are the main
storage. DataNodes perform operations like sorting the data, replica creation of data, deleting, etc according
to the command or instruction of NameNode.
Secondary NameNode:
Look in HDFS there is only one NameNode and multiple DataNode. But there can a secondary NameNode. This
secondary NameNode doesn't replace the NameNode but it will give a support to the Primary or main NameNode. We
use this nodes to take the checkpoints of the file-system.
In HDFS we have the data. Now to read the data at first client will send a request to the NameNode for the
metadata. After this, the NameNode send the number blocks where the requested data is stored, those blocks
location and the replicas and some other information. This information are needed because there are lot of
blocks and the information is huge. So in which blocks the data are present and to get those blocks we need
those blocks location also. We have to keep one thing in mind that is, there are lot of replicas of the
required information. So we also required that information. After this the client communities will read those
DataNodes in the parallel fashion. So when all the data is read, it make a combination or we can say combine
all the block and make a original file.
Let's see an image for better understanding:
In the image follow the number 1-6. At first client send open request to the distributed file system. Then the request goes to NameNode, to get the location of blocks. After getting the location client send read request to FS data input stream. After this FS data input stream got that blocks locations and start to read the data. After reading the process get terminated. To get the final output data file, all the data read blocks will get combined.
To write in HDFS client, at first send request for metadata to the NameNode. NameNode sends number of blocks,
blocks location and some other information as a response. After getting the data client split the data into
different different blocks. After splitting, then the process starts to send the data into DataNodes. When the
DataNode receive a block then its make or create a replica and store the data into another DataNode.
Let's see the process in the form of image.
In the image, follow the steps according to the number. In the first step client send create request to the distributed file system. Then this create request will go to NameNode. After getting the request NameNode will collect all the metadata and send that to the distributed file system. After this the client will divide the data into different chunks and will send write request to the FS data input stream. Now the writing part goes to the DataNodes. So the DataNodes will do their work and then will pass ack packet ass output to the FS Data Input Stream. Then client will send close request to the FS data input stream. After this the Distributed file system will send complete request to the NameNode.
At first start Hadoop to run HDFS command.
Command | Description |
---|---|
cat |
This command is used to read the file in HDFS and displays the content of the file on the console. Syntax: hadoop fs –cat /path_to_file_in_hdfs |
mkdir |
By this command you can create a new directory inside the give location. Syntax: hdfs fs -makdir <path>/dir_name |
ls |
This command will show all the files and folder with their path and also file
permissions,group,time,modification,file-size etc. Syntax: hdfs fs -ls <path> |
rm |
You can remove a file or directory using this command. Suppose the directory has some elements. In this
case we have to use -r option to delete all the internal elements. Syntax: hdfs fs -rm <file_path> hdfs fs -rm -r <directory_name> |
put |
To store data into HDFS from to local disk this command is used. This command takes multiple arguments and
all are source except last one because that is the destination path of HDFS. Syntax: hdfs fs -put <path1> <path2> <path3><destination> |
count |
This command is used to count the number directories, the number of files inside a given directory and to
display the file size of the directory. Syntax: hdfs fs -count <path> |
cp |
This command is used to copy a file from one directory to another directory in the HDFS. Syntax: hadoop fs -cp <source> <destination> |
copyFromLocal |
This command is used to copy files from our local system to the HDFS. This command has an optional switch
–f which is used to replace If that file is already a existing file in the system then you can use -f to
update that file means -f will replace the existing file. Syntax: hadoop fs -copyFromLocal <local_source> <hdfs_destination> |
copyToLocal |
This command is used to copy files from the HDFS to our local system. Syntax: hadoop fs -copyFromLocal <hdfs_destination> <local_source> |
get |
This command is used to store data into local disk from HDFS.This command takes multiple arguments and all
are source except last one because that is the detination path of local disk. Syntax: hdfs fs -get <path1> <path2> <path3><destination> |
mv |
This command is used to moves the files or directories from the source to a destination in HDFS. Syntax: hadoop fs -mv <src_or_path> <destination> |