Why mapreduce




















This master will then sub-divide the job into equal sub-parts. The job-parts will be used for the two main tasks in MapReduce: mapping and reducing. The developer will write logic that satisfies the requirements of the organization or company. The input data will be split and mapped. The intermediate data will then be sorted and merged. The reducer that will generate a final output stored in the HDFS will process the resulting output.

Image Source: Data Flair. Every job consists of two key components: mapping task and reducing task. The map task plays the role of splitting jobs into job-parts and mapping intermediate data. The reduce task plays the role of shuffling and reducing intermediate data into smaller units.

The job tracker acts as a master. It ensures that we execute all jobs. The job tracker schedules jobs that have been submitted by clients. It will assign jobs to task trackers. Each task tracker consists of a map task and reduces the task. Task trackers report the status of each assigned job to the job tracker. The following diagram summarizes how job trackers and task trackers work. Image Source: CNBlogs. The MapReduce program is executed in three main phases: mapping, shuffling, and reducing.

There is also an optional phase known as the combiner phase. This is the first phase of the program. There are two steps in this phase: splitting and mapping. A dataset is split into equal units called chunks input splits in the splitting step. Hadoop consists of a RecordReader that uses TextInputFormat to transform input splits into key-value pairs.

The key-value pairs are then used as inputs in the mapping step. This is the only data format that a mapper can read or understand. The mapping step contains a coding logic that is applied to these data blocks. In this step, the mapper processes the key-value pairs and produces an output of the same form key-value pairs. This is the second phase that takes place after the completion of the Mapping phase. It consists of two main steps: sorting and merging. In the sorting step, the key-value pairs are sorted using the keys.

Merging ensures that key-value pairs are combined. The shuffling phase facilitates the removal of duplicate values and the grouping of values. Different values with similar keys are grouped. The output of this phase will be keys and values, just like in the Mapping phase. In the reducer phase, the output of the shuffling phase is used as the input.

The reducer processes this input further to reduce the intermediate values into smaller values. It provides a summary of the entire dataset. The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing. An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map. This is the very first phase in the execution of map-reduce program.

In this phase data in each split is passed to a mapping function to produce output values. This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, the same words are clubed together along with their respective frequency. In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value.

In short, this phase summarizes the complete dataset. In our example, this phase aggregates the values from Shuffling phase i. Hadoop divides the job into tasks. There are two types of tasks:. The complete execution process execution of Map and Reduce tasks, both is controlled by two types of entities called a. For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.

Skip to content. What is MapReduce in Hadoop? The Hadoop framework decides how many mappers to use, based on the size of the data to be processed and the memory block available on each mapper server. After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers.

A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.

Combine is an optional process. The combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream. This makes shuffling and sorting easier as there is less data to work with. Often, the combiner class is set to the reducer class itself, due to the cumulative and associative functions in the reduce function.

However, if needed, the combiner can be a separate class as well. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer.

The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

Consider an ecommerce system that receives a million requests every day to process payments. There may be several exceptions thrown during these requests such as "payment declined by a payment gateway," "out of inventory," and "invalid address.

The objective is to isolate use cases that are most prone to errors, and to take appropriate action. For example, if the same payment gateway is frequently throwing an exception, is it because of an unreliable service or a badly written interface? If the "out of inventory" exception is thrown often, does it mean the inventory calculation service has to be improved, or does the inventory stocks need to be increased for certain products?

The developer can ask relevant questions and determine the right course of action. To perform this analysis on logs that are bulky, with millions of records, MapReduce is an apt programming model. Multiple mappers can process these logs simultaneously: one mapper could process a day's log or a subset of it based on the log size and the memory block available for processing in the mapper server.

For simplification, let's assume that the Hadoop framework runs just four mappers. Mapper 1, Mapper 2, Mapper 3, and Mapper 4. The value input to the mapper is one record of the log file.



0コメント

  • 1000 / 1000