Skip to content

Conversation

@otterc
Copy link
Contributor

@otterc otterc commented Nov 10, 2020

What changes were proposed in this pull request?

This is the shuffle writer side change where executors can push data to remote shuffle services. This is needed for push-based shuffle - SPIP SPARK-30602.
Summary of changes:

  • This adds support for executors to push shuffle blocks after map tasks complete writing shuffle data.
  • This also introduces a timeout specifically for creating connection to remote shuffle services.

Why are the changes needed?

  • These changes are needed for push-based shuffle. Refer to the SPIP in SPARK-30602.
  • The main reason to create a separate connection creation timeout is because the existing connectionTimeoutMs is overloaded and is used for connection creation timeouts as well as connection idle timeout. The connection creation timeout should be much lower than the idle timeouts. The default for connectionTimeoutMs is 120s. This is quite high for just establishing the connections. If a shuffle server node is bad then the connection creation will fail within few seconds. However, an overloaded shuffle server may take much longer to respond to a request and the channel can stay idle for a much longer time which is expected. Another reason is that with push-based shuffle, an executor may be fetching shuffle data and pushing shuffle data (next stage) simultaneously. Both these tasks will share the same connections with the shuffle service. If there is a bad shuffle server node and the connection creation timeout is very high then both these tasks end up waiting a long time time eventually impacting the performance.

Does this PR introduce any user-facing change?

Yes. This PR introduces client-side configs for push-based shuffle. If push-based shuffle is turned-off then the users will not see any change.

How was this patch tested?

Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in SPARK-30602.
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Min Shen [email protected]
Co-authored-by: Chandni Singh [email protected]
Co-authored-by: Ye Zhou [email protected]

@otterc
Copy link
Contributor Author

otterc commented Nov 10, 2020

@Victsm @mridulm @tgravescs @jiangxb1987 @attilapiros @Ngone51 Please take a look.

@otterc otterc force-pushed the SPARK-32917 branch 2 times, most recently from fa5a778 to b54589b Compare November 11, 2020 23:10
@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion [WIP][SPARK-32917][SHUFFLE][CORE][test-maven][test-hadoop2.7] Adds support for executors to push shuffle blocks after successful map task completion Nov 12, 2020
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35563/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35563/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130957 has finished for PR 30312 at commit c658423.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member

Ngone51 commented Nov 12, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35611/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35611/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35616/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35616/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35619/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35619/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #131004 has finished for PR 30312 at commit c658423.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Test build #131010 has finished for PR 30312 at commit 9029993.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Test build #131013 has finished for PR 30312 at commit 1c43ac0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ShufflePushBlockId(shuffleId: Int, mapIndex: Int, reduceId: Int) extends BlockId

@SparkQA
Copy link

SparkQA commented Dec 19, 2020

Test build #133044 has finished for PR 30312 at commit 6aae02a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37649/

@SparkQA
Copy link

SparkQA commented Dec 19, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37649/

@SparkQA
Copy link

SparkQA commented Dec 19, 2020

Test build #133049 has finished for PR 30312 at commit 21ea881.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@otterc
Copy link
Contributor Author

otterc commented Jan 4, 2021

Have addressed all the comments so far. The failing test seems unrelated.
cc @Ngone51 @Victsm @mridulm

@mridulm
Copy link
Contributor

mridulm commented Jan 7, 2021

ok to test

@mridulm
Copy link
Contributor

mridulm commented Jan 7, 2021

Given no other comments, will merge once tests pass (have retriggered it).

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133803 has started for PR 30312 at commit 21ea881.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38392/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38392/

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38396/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38396/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133807 has finished for PR 30312 at commit 21ea881.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in d00f069 Jan 8, 2021
@mridulm
Copy link
Contributor

mridulm commented Jan 8, 2021

Merged to master.
Thanks for working on this @otterc !
And thanks for all the reviews and comments @Ngone51, @dongjoon-hyun, @Victsm

@otterc
Copy link
Contributor Author

otterc commented Jan 8, 2021

Thanks @mridulm for merging and also reviewing. Thanks @Ngone51, @dongjoon-hyun, and @Victsm for the reviews as well.

@dongjoon-hyun
Copy link
Member

Thank you, @otterc and @mridulm and all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants