Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

We have spark.shuffle.io.connectionTimeout for timeout checking when shuffle clients establish connections with shuffle servers. We also use it to check the established connection is idled or dead too. If we want a connection that can keep alive for a long time, we may increase the blocking time to create a shuffle client if the server is busy at that time.

In practice, we always use connection timeout and session timeout to control these two different phases.
This PR adds spark.shuffle.io.sessionTimeout to check an established connection for being idled or dead

Why are the changes needed?

Does this PR introduce any user-facing change?

yes, w/ new conf spark.shuffle.io.sessionTimeout ` to check an established connection for being idled or dead

How was this patch tested?

modified tests

…heck a established connnection for being idled or dead
@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #133029 has finished for PR 30844 at commit 9f9f2f6.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37628/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37628/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37632/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37632/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #133033 has finished for PR 30844 at commit e51904f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Dec 19, 2020

+CC @otterc

@yaooqinn
Copy link
Member Author

cc @cloud-fan @maropu @HyukjinKwon too thanks

@otterc
Copy link
Contributor

otterc commented Dec 19, 2020

@yaooqinn
In #30312, we have introduced a connectionCreationTimeout and I think the motivation for that change and this one is the same. So, we should decide which implementation, related with this config, we would want go ahead with.
cc @mridulm @Ngone51 @Victsm @tgravescs

@yaooqinn
Copy link
Member Author

Oh, sorry @otterc, I didn't notice #30312.
IMHO, connectionCreationTimeout seems to be not a common usage. But sessionTimeout and connectionTimeout are always used as a pair in RPC environments.
#30312 seems will get merged soon. I am fine w/ that too.

@otterc
Copy link
Contributor

otterc commented Dec 21, 2020

IMHO, connectionCreationTimeout seems to be not a common usage. But sessionTimeout and connectionTimeout are always used as a pair in RPC environments.

Just stating the reason for why we went with creating connectionCreationTimeout instead of idle/sessionTimeout.
We wanted to reduce the timeout for creating connections. Ideally this should be much lower and for our clusters we have set this to 30s. However, the default value for connectionTimeout is high which makes sense because it is also used as idleTimeout and since shuffle services can be overloaded, they may take longer to respond. The problem with creating a separate config for idle/sessionTimeout and not for connectionCreation is that some users (frameworks) may have also over-written the value of connectionTimeout to a much higher value. So, even if want to enforce all these jobs to use smaller timeouts for creating connections, they will still be using a larger timeouts for that.

@yaooqinn yaooqinn closed this Jan 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants