Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Hadoop Introduction

Hadoop Hbase

Hadoop HDFS

Hadoop Hive

Hadoop Map Reduce

Let's learn everything about map reduce in hadoop



What is Map Reduce?

In the image we can see input then map tasks then reduce tasks and then the output. In map tasks, map method will working and in reduce task reduce method will be working and the output will be aggregated form. Map reduce is used to process a large amount of data. If we take a huge amount of data at a time then the processing will be slower and can raise some error. By using map reduce large amount of data will not process at a time. The data will divided into smaller pieces and those pieces will be assign to the working nodes and those tasks will be executed in parallel for the faster processing. So the whole working process plan or the whole task will taken from the user and will be divided into smaller tasks and those smaller tasks will be assign to the working nodes. Map reduce program takes the input as a list and also give output as a list.

The Map task

In the image we can see map task section where the map method will be used.

Let's discuss the map task:
Here the input is in the form of key:value pair. It doesn't matter what king of data is, it will makes it into key:value pair. Here we can reference key as a input file and value as dataset. Here you can create you own business logic according to you need for data processing. So what we are doing is that we create a logic that how the work will be performed and there is no predefined logic, you have to write your own logic.

The reduce task

In the image we can see reduce task section where the reduce method will be used.

Let's discuss the reduce task:
At first in reduce task, the reducer will the take key:value pair which is created by mapper in map task as input. After this the reducer will perform some operations or works like summation, aggregation, sorting, etc.
So we can say, at first reducer takes input from mapper(mapper task). Mapper(mapper task) generates intermediate data by the logics given by the user. Reducer takes these intermediate data as input. After the taking the intermediate data, the data will process according to the user defined function. After this we will get the final output and this output will stored in HDFS.

Let's see a proper diagram




What is Shuffling and Sorting in MapReduce?

Mapper intermediate value means key:value pair is the input of reducer task. In the shuffling process system sort the data using key:value. Shuffling process doesn't start after the whole mapping process is done. This process is starts when some of the mapping process are done.
We know that after shuffling task the reducing task start. In shuffling the data will sorted according to the key and this sorted key:value will be the input of reduce task.
Why we need to sort the key:value?
We need to sort the key:value or data so that reducer can understand when a new reducing task will start. If there is no reducer task then there is no need of shuffling and sorting task.

Let's see another diagram of Map reduce working process:




Job tracker or Resource manager in MapReduce

In hadoop there are different clients and these client send their jobs to hadoop to perform and the jobs or tasks of these clients are different also need to complete in order. Suppose there are multiple tasks. Now which order the task should executed, which task should execute right now and which should execute letter and when, all these things will be done by tracker.

There are mainly three different job scheduler in map reduce:
1. Fast In Fast Out(FIFO)
2. Capacity Scheduler
3. Fair Scheduler
Here the default technique is FIFO.

How FIFO works?




In the image we can see priority queue. There the different color squares means different jobs from different client. According to the priority a jobs goes to job tracker and we know that job tracker can only perform scheduling in this case FIFO and then the jobs goes to nodes and then slots inside nodes and the jobs are done in slots. There are multiple nodes and slots.
One disadvantage is:
Suppose one job is taken but a new job comes and the new jobs priority is higher than the taken job but the new job have to wait until the taken job is completed. So here we can see that a higher priority job has to wait for a long time even the job is a higher priority than the taken job.

How Capacity Scheduler works?



Here the scheduler has multiple queues to schedule the task. In the image we can see organization A and organization B. These are the queues. Now in job under process section red colored square is a queue from organization A and sky blue colored square is another queue from organization B. Now these two will go in the nodes through job tracker. Because these two will go together inside nodes, some slots will used for first queue(red square ) and some will used for second queue(sky blue square). So we can say that here slots are dedicated for a queues. If there is now job for organization B then organization B will use those slots which are dedicated for organization B. But after some times new job comes in organization B. In this case, organization A which is using slots of organization A and also B will release or left the slots of organization B and this way organization B will take his slots back and will start to process.

How Fair Scheduler works?



This scheduler is similar to the Capacity scheduler.
In the image we can say that three tasks in job under process section and we can also see that some slots are dedicated for pool A and some slots are dedicated for pool B. In the image red colored boxes are for pool A and blue colored boxes are for pool B. Now when a higher priority jobs comes then some dedicated slots are used for that task because the priority is high than the other job and a higher priority jobs can't wait. In the image we can see that red and sky blue jobs are in the process in their dedicated slots but when cyan colored higher job comes, it use some dedicated slots of pool A.

Map Reduce Algorithm



In mapper task three things happens:
1. tokenizing input,
2. Mapping
3.Shuffle & Sort.
In reducer task two things happens:
1. searching
2. reducing.
We can say that first input goes in mapper, there those three works happen and after this we get output and the output goes to reducer, there those two works happens and at the end we get the output. Here input and output will be in the form of key:value pair.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.