Spark shuffle concept

manba_ yqq 2022-01-15 02:23:56

reduceByKey Will be the last RDD Each of them key All corresponding value polymerization Become a value, And then you generate a new one RDD, The type of element is <key,value> Right form , So each of these key Corresponding to an aggregated value.

problem : Before aggregation , every last key Corresponding value It doesn't have to be all in one partition in , It's not likely to be on the same node , because RDD It's distributed elasticity Data set of ,RDD Of partition It's very likely to be distributed across nodes .

How to aggregate ?

  • Shuffle Write: the previous stage Each map task You must ensure that you The data of the current partition processed is the same key Write to a partition file , Multiple may be written In a different partition file .
  • – Shuffle Read:reduce task From the last one stage All of the task Where Find the partition files that belong to you on your machine , So that every one of them can be guaranteed key Yes Ought to be value They all converge to the same node for processing and aggregation .

Spark There are two kinds of Shuffle Management type ,HashShufflManager and SortShuffleManager,Spark1.2 It used to be HashShuffleManager, Spark1.2 introduce SortShuffleManager, stay Spark 2.0+ The version already includes HashShuffleManager discarded .

Similar articles