Does hadoop streaming use a stable sort between map and reduce phases?

https://stackoverflow.com/questions/8579989

22-03-2021
|

Question

This has ramifications for multi-stage jobs. For example if we sort by key "a" in phase 1 of the job and key "b" in phase 2 of the job (which takes phase 1 output as stdin), can we assume when the two phases are complete that the records are sorted by key "b" and secondarily by key "a"? For the purpose of this question, assume that the mappers and reducers do not permute record order. Also assume the number of reduce tasks are 1 or more.

Bear in mind the answer may vary depending on the number of reduce tasks for phase 1. For example, if the number of reduce tasks for phase 1 were greater than 1, the key a would be split across multiple files (though in sorted order with respect to each file). However, when there is only one reduce tasks all values will appear in the same file and this may be a necessary condition for stability, depending on implementation.

If the answer is affirmative, a link to appropriate documentation will be most helpful.

Thanks,

SetJmp

Solution

By default, Hadoop will not enforce the stable sorting properties you desire.

Hadoop streaming has Comparator and Partitioner to help sort results from the map to the reduce; take a look here

Edit: updated broken link

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow