An Improved Shuffling Approach Towards Skew Mitigation in Mapreduce
N. K. Seera1 , S. Taruna2
Section:Research Paper, Product Type: Journal Paper
Volume-6 ,
Issue-7 , Page no. 819-826, Jul-2018
Online published on Jul 31, 2018
Copyright © N. K. Seera, S. Taruna . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
View this paper at Google Scholar | DPI Digital Library
591 | 303 downloads | 159 downloads |
In MapReduce applications, map tasks are generally launched in parallel and are assigned equal sized input splits to work on. Thus map side skews are rare to occur. In contrast, reduce side skews are much more challenging because the shuffling of the intermediate data, partition sizes and partition assignment to worker nodes cannot be determined at early stages. Therefore it is one of the critical problems in MapReduce model which should be thoroughly studied and possible solutions need to framed. This paper studies various causes of skew and common approaches used for skew mitigation in real world applications. Paper presents a novel approach to address reduce side skew where the large volume of intermediate data is preprocessed by intermediate nodes to make the size of intermediate keys smaller. The partial results from intermediate nodes are collected, aggregated and sent to final worker nodes to generate final output. The proposed model is applicable to applications where there is no interdependency between values of similar keys. The approach used by proposed model is contrary to the approach where the data of skewed nodes is repartitioned dynamically into small fragments and assigned to idle nodes in the cluster.
Key-Words / Index Term
MapReduce, Skew Mitigation, Shuffling, Partitioning
