[SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle #30480

Victsm · 2020-11-24T08:52:44Z

What changes were proposed in this pull request?

This is one of the patches for SPIP SPARK-30602 for push-based shuffle.
Summary of changes:

Introduce MergeStatus which tracks the partition level metadata for a merged shuffle partition in the Spark driver
Unify MergeStatus and MapStatus under a single trait to allow code reusing inside MapOutputTracker
Extend MapOutputTracker to support registering / unregistering MergeStatus, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions.

The added APIs in MapOutputTracker will be used by DAGScheduler in SPARK-32920 and by ShuffleBlockFetcherIterator in SPARK-32922

Why are the changes needed?

Refer to SPARK-30602

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests.

Lead-authored-by: Min Shen [email protected]
Co-authored-by: Chandni Singh [email protected]
Co-authored-by: Venkata Sowrirajan [email protected]

Victsm · 2020-11-24T17:03:17Z

Seems all recent PRs in Spark are all failing the build at the javadoc step.
@dongjoon-hyun @mridulm is there any recent change that's lead to this?

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

dongjoon-hyun · 2020-11-24T18:05:16Z

ok to test

SparkQA · 2020-11-24T18:09:15Z

Test build #131687 has finished for PR 30480 at commit b9c43c5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-24T19:06:04Z

Test build #131691 has finished for PR 30480 at commit 3723389.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-25T08:33:11Z

Test build #131710 has finished for PR 30480 at commit cc1c077.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-25T19:55:52Z

Retest this please

SparkQA · 2020-11-25T22:05:24Z

Test build #131807 has finished for PR 30480 at commit f18fc47.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Victsm · 2020-11-25T23:08:13Z

@tgravescs @attilapiros @Ngone51 @jiangxb1987 @otterc @mridulm @dongjoon-hyun PR ready for review.

SparkQA · 2020-11-26T00:53:51Z

Test build #131792 has finished for PR 30480 at commit 384baed.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

core/src/main/scala/org/apache/spark/scheduler/MergeStatus.scala

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Ngone51 · 2020-11-30T12:25:20Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Is this necessary? (except the tests)

This is only used in test, will clarify.

SparkQA · 2020-12-02T23:29:03Z

Test build #132071 has finished for PR 30480 at commit d7f872d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Thanks for the work !
Took a pass through it @Victsm

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

mridulm · 2021-03-08T16:48:46Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

If fetchMergeResult == true, is it right that there is an expectation that (mapOutputStatuses == null) == (mergeResultStatuses == null) ?

If yes, can we simplify this ?
a) Make this method simpler by using that condition.
b) Do we have any usecase for GetMergeResultStatuses withough also fetching GetMapOutputStatuses immediately before ? If not, combine both to avoid two rpc's when fetchMergeResult == true ?

That's not always true.
We currently fetch map status and merge status using 2 separate RPCs.
Although the fetching of these statuses is guarded by the lock, the initial check at line 1113 for these statuses being not null is out of the lock.
So, it would be possible that a task might see the map status being non-null while merge status being null.

We always need to fetch both map status and merge status together, either during the initial fetch or during fallback.
Combine both RPCs into 1 would increase the code complexity.
For now, the RPC just returns the pre-serialized bytes for either MapStatus array or MergeStatus array.
If we want to combine both into a single RPC, we would need to define additional RPC messages so that we can encode the 2 byte arrays for serialized MapStatus array and MergeStatus array together.
Combining both together does not seem to bring enough benefits.
We haven't observed any issue indicating Spark driver performance regression with doubling the number of RPCs for fetching shuffle statuses.
This would also help to keep code simpler.

I would expect multiple rpc's to not be the preferred option given the impact on driver, but code simplicity needs to be balanced against that.
+CC @JoshRosen, @Ngone51 who last made changes here. Any thoughts on this ?

I'd prefer to combine them. Actually, the first time when I reviewed this PR, I began to think about a unified way to provide a consistent API for both map status and merged status in MapOutputTracker & ShuffleStatus. Unfortunately, I didn't get a good idea.

I think one RPC would ease the error handling for us. Not sure how much complexity you'd expect?

And I'd suggest adding an additional new RPC for the combined case and leave the current one as it is, so that we don't affect the existing code path when push-based shuffle disabled.

One option could be replace GetMergeResultStatuses with GetMapOutputAndMergeResultStatuses.
That keeps non push based shuffle codepaths unchanged, and when push based shuffle is enabled, a single rpc handles the response : the code change would mirror what has been done for GetMergeResultStatuses already.

Thoughts @Ngone51, @Victsm, @venkata91 ?

With the current RPC (RpcCallContext) mechanism with MapOutputTracker, we can only send one response as oppose to other RPC mechanisms with in Spark. If we have to combine getting both MapStatuses and MergeStatuses when push based shuffle is enabled, then we have couple of options:

Encode both MapStatuses and MergeStatuses in the same Array[Byte] getting returned from serializedOutputStatus with some encoding scheme like length in bytes of mapStatuses as first part and then mapStatuses similarly for mergeStatuses and in the deserializeOutputStatuses we have to decode it accordingly for the output of GetMapOutputAndMergeResultStatuses RPC call. This is some what not a cleaner approach as the client keeps the semantics of encoding/decoding of the byte array instead of the RPC layer itself. Although this is already being done wrt whether the mapStatuses are a DIRECT fetch or BROADCAST fetch.

If not, we might need to make changes to RpcCallContext in order to respond with 2 byte arrays. This seems to be lot of additional overhead just for this purpose.

Any other suggestions? cc @Victsm @mridulm @Ngone51

Gentle ping @Victsm @mridulm @Ngone51

That is an implementation detail of what is response of GetMapOutputAndMergeResultStatuses right ?
It can simply be encoding of Array[Array[Byte]] (for example) - where result(0) is for MapStatus and result(1) is for MergeStatus - keeping everything else same ?

Makes sense @mridulm Instead of Array[Array[Byte]] I used a tuple of (Array[Byte], Array[Byte]). cc @Victsm

I am fine with Tuple as well.
+CC @Ngone51 in case you have any other thoughts.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Victsm · 2021-03-21T17:54:54Z

Just to provide an update here on this PR.
The group at LinkedIn has been focusing on improving the internal version of push-based shuffle in order to roll it out to 100% of the offline Spark compute workload at LinkedIn since beginning of this year.
We have reached that milestone internally at LinkedIn earlier this month and have seen significant improvements.
This is another testimony of the overall scalability and benefits of the solution.
The group is switching focus back to the remaining upstream PRs now.

1. Handling of MergeResults from the executors in MapOutputTracker 2. Shuffle merge finalization in DagScheduler This also includes the following changes: - LIHADOOP-52972 Tests for changes in MapOutputTracker and DagScheduler related to pushbased shuffle. Author: Chandni Singh <[email protected]> - LIHADOOP-52202 Utility to create a directory with 770 permission. Author: Chandni Singh <[email protected]> - LIHADOOP-52972 Moved isPushBasedShuffleEnabled to Utils and added a unit test for it. Author: Ye Zhou <[email protected]>

…nges

…scheduler encounters a shuffle chunk failure RB=2151376 BUG=LIHADOOP-54115 G=spark-reviewers R=yezhou,mshen A=mshen

… a shuffle chunk fails

…te lock

venkata91 · 2021-04-14T19:03:44Z

Test failures seems unrelated, can we kick off a new test run @mridulm ? Gentle ping @Ngone51 for another review.

mridulm · 2021-04-15T04:02:51Z

ok to test

mridulm · 2021-04-15T04:32:16Z

Looks good to me, thanks for the changes @venkata91
+CC @Ngone51, @tgravescs, @attilapiros for another pass.

SparkQA · 2021-04-15T05:38:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41974/

SparkQA · 2021-04-15T05:38:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41974/

SparkQA · 2021-04-15T07:17:24Z

Test build #137398 has finished for PR 30480 at commit 9614a0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

venkata91 · 2021-04-15T22:48:30Z

@mridulm It seems like it is still failing, not sure why these tests are failing. I ran the failing tests on my laptop that worked fine. Checked the org.apache.spark.network.RpcIntegrationSuite that ran fine on my local laptop.

Ngone51 · 2021-04-16T06:14:52Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+          if (fetchedMapStatuses == null || fetchedMergeStatuses == null) {
+            logInfo("Doing the fetch; tracker endpoint = " + trackerEndpoint)
+            val fetchedBytes =
+              askTracker[(Array[Byte], Array[Byte])](GetMapAndMergeResultStatuses(shuffleId))


I may miss some discussion after my last discussion, I think this breaches our decision made before:

we won't affect the existing code path in the case of map status only.

I think you can return the mapstatus only at the sender side to keep the same behavior?

do you mean separate out the handling of both GetMapStatusMessage and GetMapAndMergeStatusMessage to avoid returning (mapStatuses, null) in the case of GetMapStatusMessage and keep it the same way as it is before, just returning mapStatuses?

Yes. (cc @mridulm)

@Ngone51 I updated the PR assuming that is what you meant with your above comment. Let me know if thats not the case.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Ngone51 · 2021-04-16T07:30:30Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+    if (shuffleStatus != null) {
+      // Check if the map output is pre-merged and if the merge ratio is above the threshold.
+      // If so, the location of the merged block is the preferred location.
+      val preferredLoc = if (pushBasedShuffleEnabled) {


Doesn't this path need to respect shuffleLocalityEnabled too?

I agree that we should make it consistent, but there's also clear difference between locality calculation for push-based shuffle and the original shuffle.
My understanding of the reason for adding this flag is due to the potentially costly computation for shuffle locality in the original shuffle.
For push based shuffle, that cost is no longer a concern, and the reducer task can achieve much much better locality.
Always calculating shuffle locality is preferred.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

SparkQA · 2021-04-19T03:42:00Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42119/

SparkQA · 2021-04-19T03:55:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42116/

SparkQA · 2021-04-19T03:55:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42116/

SparkQA · 2021-04-19T05:46:45Z

Test build #137544 has finished for PR 30480 at commit 7dd24bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-19T07:07:31Z

Test build #137556 has finished for PR 30480 at commit 9614a0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-19T07:15:35Z

Test build #137557 has finished for PR 30480 at commit 7dd24bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T00:47:37Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42186/

SparkQA · 2021-04-20T02:11:56Z

Test build #137657 has finished for PR 30480 at commit d1422bd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Changes look good to me.
+CC @Ngone51, @attilapiros, @tgravescs for another pass before merging...

mridulm · 2021-04-26T04:41:29Z

ok to test

mridulm · 2021-04-26T05:21:16Z

Merged to master, thanks for working on this @venkata91 and @Victsm !
Thanks for the reviews @Ngone51, @dongjoon-hyun, @otterc

SparkQA · 2021-04-26T05:36:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42455/

SparkQA · 2021-04-26T05:36:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42455/

…sed shuffle ### What changes were proposed in this pull request? This is one of the patches for SPIP SPARK-30602 for push-based shuffle. Summary of changes: - Introduce `MergeStatus` which tracks the partition level metadata for a merged shuffle partition in the Spark driver - Unify `MergeStatus` and `MapStatus` under a single trait to allow code reusing inside `MapOutputTracker` - Extend `MapOutputTracker` to support registering / unregistering `MergeStatus`, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions. The added APIs in `MapOutputTracker` will be used by `DAGScheduler` in SPARK-32920 and by `ShuffleBlockFetcherIterator` in SPARK-32922 ### Why are the changes needed? Refer to SPARK-30602 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Venkata Sowrirajan vsowrirajanlinkedin.com Closes apache#30480 from Victsm/SPARK-32921. Lead-authored-by: Venkata krishnan Sowrirajan <[email protected]> Co-authored-by: Min Shen <[email protected]> Co-authored-by: Chandni Singh <[email protected]> Co-authored-by: Chandni Singh <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…sed shuffle This is one of the patches for SPIP SPARK-30602 for push-based shuffle. Summary of changes: - Introduce `MergeStatus` which tracks the partition level metadata for a merged shuffle partition in the Spark driver - Unify `MergeStatus` and `MapStatus` under a single trait to allow code reusing inside `MapOutputTracker` - Extend `MapOutputTracker` to support registering / unregistering `MergeStatus`, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions. The added APIs in `MapOutputTracker` will be used by `DAGScheduler` in SPARK-32920 and by `ShuffleBlockFetcherIterator` in SPARK-32922 Refer to SPARK-30602 No Added unit tests. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Venkata Sowrirajan vsowrirajanlinkedin.com Closes #30480 from Victsm/SPARK-32921. Lead-authored-by: Venkata krishnan Sowrirajan <[email protected]> Co-authored-by: Min Shen <[email protected]> Co-authored-by: Chandni Singh <[email protected]> Co-authored-by: Chandni Singh <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

github-actions bot added the CORE label Nov 24, 2020

dongjoon-hyun reviewed Nov 24, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Outdated Show resolved Hide resolved

Victsm mentioned this pull request Nov 25, 2020

[SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion #30312

Closed

dongjoon-hyun changed the title ~~[SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle~~ [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle Nov 25, 2020

Ngone51 reviewed Nov 30, 2020

View reviewed changes

venkata91 mentioned this pull request Dec 9, 2020

[SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage #30691

Closed

mridulm reviewed Mar 8, 2021

View reviewed changes

Victsm and others added 12 commits March 21, 2021 11:07

LIHADOOP-53321 Magnet: Merge client shuffle block fetcher related cha…

81e6a5b

…nges

LIHADOOP-54115 Unregister map and merge outputs on the host when DAG …

e1cc409

…scheduler encounters a shuffle chunk failure RB=2151376 BUG=LIHADOOP-54115 G=spark-reviewers R=yezhou,mshen A=mshen

LIHADOOP-52494 Magnet fallback to origin shuffle blocks when fetch of…

bfd3a73

… a shuffle chunk fails

Magnet: Serialization of merge status shoud use the reentrant readwri…

2bf9502

…te lock

Fixed the compilation error in MOT

23975c4

Prepare for PR

f51a806

Fix Scala 2.13 compatibility issue

9fee2ef

Fix build issues

10f3079

Fix javadoc issue

b37efa4

Fix more javadoc issue

a188649

Address review comments

3c5fc12

Ngone51 reviewed Apr 16, 2021

View reviewed changes

venkata91 force-pushed the SPARK-32921 branch 2 times, most recently from 593f092 to e62e953 Compare April 19, 2021 02:16

Address ngone51 comments

7dd24bc

venkata91 force-pushed the SPARK-32921 branch from e62e953 to 7dd24bc Compare April 19, 2021 02:21

Address other comments

d1422bd

mridulm approved these changes Apr 21, 2021

View reviewed changes

asfgit closed this in 38ef477 Apr 26, 2021

[SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle #30480

[SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle #30480

Uh oh!

Conversation

Victsm commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Victsm commented Nov 24, 2020

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 24, 2020

Uh oh!

SparkQA commented Nov 24, 2020

Uh oh!

SparkQA commented Nov 24, 2020

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

dongjoon-hyun commented Nov 25, 2020

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

Victsm commented Nov 25, 2020

Uh oh!

SparkQA commented Nov 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2020

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Apr 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Victsm commented Mar 21, 2021

Uh oh!

venkata91 commented Apr 14, 2021

Uh oh!

mridulm commented Apr 15, 2021

Victsm commented Nov 24, 2020 •

edited

Loading

mridulm Apr 5, 2021 •

edited

Loading