-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions #30164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…huffle by selecting external shuffle services for merging partitions
|
cc @Victsm @mridulm @otterc @zhouyejoe @tgravescs @jiangxb1987 @attilapiros @Ngone51 |
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
|
ok to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
the paper talks about: I know this was also talked about in the SPIP. The current implementation seems to not do this. can the description here please be updated to state what it does and why. Is this something to be PR'd later or just talked about as a possibility |
|
@tgravescs
This PR enables the first.
In a follow up patch for driver side change (MapOutputTracker#getPreferredLocationsForShuffle), the second is enabled.
|
|
Test build #130376 has finished for PR 30164 at commit
|
|
thanks for the explanation |
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
Outdated
Show resolved
Hide resolved
| * at the same time. Will improve this in a later version. | ||
| */ | ||
| def isPushBasedShuffleEnabled(conf: SparkConf): Boolean = { | ||
| conf.get(PUSH_BASED_SHUFFLE_ENABLED) && conf.get(SHUFFLE_SERVICE_ENABLED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look forward to the day when the second condition will be disabled :-)
It will be relevant for both k8s and spark streaming !
+CC @dongjoon-hyun you might be interested in this in future.
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala
Outdated
Show resolved
Hide resolved
|
Test build #130533 has finished for PR 30164 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130575 has finished for PR 30164 at commit
|
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
Outdated
Show resolved
Hide resolved
|
Test build #130628 has finished for PR 30164 at commit
|
|
Kubernetes integration test status success |
|
Kubernetes integration test status success |
|
Test build #131302 has finished for PR 30164 at commit
|
|
Test build #131298 has finished for PR 30164 at commit
|
|
Test build #131303 has finished for PR 30164 at commit
|
|
Test build #131297 has finished for PR 30164 at commit
|
|
The scala-2.13 failure is unrelated to this PR, and is getting fixed. |
No thats it, only addressing @dongjoon-hyun comments and merging the upstream master as it had some merge conflicts. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131367 has finished for PR 30164 at commit
|
|
Can you update the PR @venkata91 ? The failures are unrelated, but want to make sure - the upstream changes should help fix the issues. Thx ! |
Done. Thanks. |
|
ok to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
+CC @dongjoon-hyun Looks like the test failures in jenkins are from executor decomissioning. The github actions went through fine. |
|
Test build #131378 has finished for PR 30164 at commit
|
|
I agree with you. @mridulm . Please proceed to merge. |
|
Due to the K8s IT flakiness of |
|
Merging to master, thanks for working on this @venkata91 ! |
|
Thank you all! FYI, K8s IT is happy on |
Thanks Mridul for shepherding this all along :) |
…s for coordinating push based shuffle by selecting external shuffle services for merging partitions Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions. This PR includes changes related to `ShuffleMapStage` preparation which is selection of merger locations and initializing them as part of `ShuffleDependency`. Currently this code is not used as some of the changes would come subsequently as part of https://issues.apache.org/jira/browse/SPARK-32917 (shuffle blocks push as part of `ShuffleMapTask`), https://issues.apache.org/jira/browse/SPARK-32918 (support for finalize API) and https://issues.apache.org/jira/browse/SPARK-32920 (finalization of push/merge phase). This is why the tests here are also partial, once these above mentioned changes are raised as PR we will have enough tests for DAGScheduler piece of code as well. Added a new API in `SchedulerBackend` to get merger locations for push based shuffle. This is currently implemented for Yarn and other cluster managers can have separate implementations which is why a new API is introduced. Yes, user facing config to enable push based shuffle is introduced Added unit tests partially and some of the changes in DAGScheduler depends on future changes, DAGScheduler tests will be added along with those changes. Lead-authored-by: Venkata krishnan Sowrirajan vsowrirajanlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes #30164 from venkata91/upstream-SPARK-32919. Lead-authored-by: Venkata krishnan Sowrirajan <[email protected]> Co-authored-by: Min Shen <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
What changes were proposed in this pull request?
Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions.
This PR includes changes related to
ShuffleMapStagepreparation which is selection of merger locations and initializing them as part ofShuffleDependency.Currently this code is not used as some of the changes would come subsequently as part of https://issues.apache.org/jira/browse/SPARK-32917 (shuffle blocks push as part of
ShuffleMapTask), https://issues.apache.org/jira/browse/SPARK-32918 (support for finalize API) and https://issues.apache.org/jira/browse/SPARK-32920 (finalization of push/merge phase). This is why the tests here are also partial, once these above mentioned changes are raised as PR we will have enough tests for DAGScheduler piece of code as well.Why are the changes needed?
Added a new API in
SchedulerBackendto get merger locations for push based shuffle. This is currently implemented for Yarn and other cluster managers can have separate implementations which is why a new API is introduced.Does this PR introduce any user-facing change?
Yes, user facing config to enable push based shuffle is introduced
How was this patch tested?
Added unit tests partially and some of the changes in DAGScheduler depends on future changes, DAGScheduler tests will be added along with those changes.
Lead-authored-by: Venkata krishnan Sowrirajan [email protected]
Co-authored-by: Min Shen [email protected]