reduceByKey Will be the last RDD Each of them key All corresponding value polymerization Become a value, And then you generate a new one RDD, The type of element is <key,value> Right form , So each of these key Corresponding to an aggregated value.
problem ： Before aggregation , every last key Corresponding value It doesn't have to be all in one partition in , It's not likely to be on the same node , because RDD It's distributed elasticity Data set of ,RDD Of partition It's very likely to be distributed across nodes .
How to aggregate ？
- Shuffle Write： the previous stage Each map task You must ensure that you The data of the current partition processed is the same key Write to a partition file , Multiple may be written In a different partition file .
- – Shuffle Read：reduce task From the last one stage All of the task Where Find the partition files that belong to you on your machine , So that every one of them can be guaranteed key Yes Ought to be value They all converge to the same node for processing and aggregation .
Spark There are two kinds of Shuffle Management type ,HashShufflManager and SortShuffleManager,Spark1.2 It used to be HashShuffleManager, Spark1.2 introduce SortShuffleManager, stay Spark 2.0+ The version already includes HashShuffleManager discarded .