Skip to content

Conversation

@nchammas
Copy link
Contributor

This PR re-introduces 0e648bc from PR #2339, which somehow never made it into the codebase.

Additionally, it removes a now-unnecessary linear backoff on the SSH checks since we are blocking on EC2 status checks before testing SSH.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23175 has started for PR 3195 at commit adb4eaa.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23175 has finished for PR 3195 at commit adb4eaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23175/
Test PASSed.

@nchammas
Copy link
Contributor Author

cc @JoshRosen @shivaram

@nchammas
Copy link
Contributor Author

Pinging @JoshRosen and @shivaram again.

This PR re-introduces a commit that somehow disappeared from #2339, and I think it should be good for inclusion in 1.2.0, since that's what it was originally slated for.

ec2/spark_ec2.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is capturing the output not required anymore ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it never was.

When I originally added this in, I was testing in the Python shell and got confused. i.update() echoes its return value to the shell, so I thought I needed to suppress that output. But as part of a program, nothing gets echoed. We don't need the return value, so there's no need to capture it.

It's like just typing var in the shell and seeing its value. As part of a program, var alone will do nothing.

@shivaram
Copy link
Contributor

@nchammas I am bit confused about this change. The EC2 instance status checks seemed to be working fine when I tried things out last week, so I am not sure what we are missing. Is it just removing the exponential back-off ?

@nchammas
Copy link
Contributor Author

spark-ec2 currently blocks on SSH availability correctly, but it tests SSH more often than is necessary. As you suggested here, this PR reintroduces a change that makes spark-ec2 check the EC2 status checks before trying SSH. This change was already in #2339, but someone got dropped. I'm not sure how.

Remember we were concerned about forking too many processes to test SSH. Without this patch, spark-ec2 tests SSH the moment the instances are marked as online. With this patch, spark-ec2 waits until the EC2 status checks additionally clear before testing SSH.

Since this patch sharply reduces the number of SSH checks we need to make, there is no need for a back-off. The back-off was originally introduced as a measure to reduce the number of SSH checks.

@SparkQA
Copy link

SparkQA commented Nov 18, 2014

Test build #23556 has started for PR 3195 at commit ef3ca99.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 18, 2014

Test build #23556 has finished for PR 3195 at commit ef3ca99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23556/
Test PASSed.

@nchammas
Copy link
Contributor Author

@shivaram I think this PR is good to go. Did my response to your questions make sense?

@shivaram
Copy link
Contributor

Yeah this makes sense, thanks for clarifying.

One thing I am still thinking about is if it makes sense to do constant poll vs. linear back-off. From what I can see we make the EC2 status checks now every 5 seconds. I think that should be fine, though in the past I have run into cases where EC2 requests get rate-limited and start failing.

@JoshRosen any other thoughts ?

@JoshRosen
Copy link
Contributor

This seems okay to me, I guess. If we're being really conservative, I guess we could just leave the linear back-off in place.

@nchammas
Copy link
Contributor Author

Yeah, I removed it specifically to be more aggressive and shave off some
seconds from the launch time. Do you think that's OK? Or would you prefer
the back off stay in?
On 2014년 11월 26일 (수) at 오후 2:59 Josh Rosen [email protected] wrote:

This seems okay to me, I guess. If we're being really conservative, I
guess we could just leave the linear back-off in place.


Reply to this email directly or view it on GitHub
#3195 (comment).

@JoshRosen
Copy link
Contributor

I guess I'd prefer to leave it in, but that's only because I don't want to risk a regression and don't really have time to figure out whether removing it is safe.

This reverts commit adb4eaa and
increases the wait interval to 5 seconds.
@nchammas
Copy link
Contributor Author

OK @JoshRosen, I've added the backoff back in.

I didn't notice any regressions without the linear backoff, so I'll make a note to remove it in a future version and document the time savings. Fair enough?

@SparkQA
Copy link

SparkQA commented Nov 29, 2014

Test build #23945 has started for PR 3195 at commit efb29e1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 29, 2014

Test build #23945 has finished for PR 3195 at commit efb29e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23945/
Test PASSed.

@JoshRosen
Copy link
Contributor

LGTM, so I'm going to merge this into master. Thanks!

@asfgit asfgit closed this in 317e114 Nov 29, 2014
@nchammas nchammas deleted the remove-ec2-ssh-backoff branch November 29, 2014 17:58
@JoshRosen
Copy link
Contributor

Hmm, looks like I didn't resolve the JIRA issue here. Should this also be targeted for any backports, or is it fine to just have this in 1.3.0?

@nchammas
Copy link
Contributor Author

@JoshRosen This work was originally supposed to be in for 1.2.0, but since it's not a critical piece of work I don't think there is a need to backport it for 1.2.1 or whatnot.

All it does is reduce the number of SSH calls made during launch by using the EC2 status checks, so I think it's OK to leave it for 1.3.0.

@JoshRosen
Copy link
Contributor

Alright, in that case I'm going to mark SPARK-4325 as 'Resolved'. Thanks!

@nchammas
Copy link
Contributor Author

FYI: The commit message at 317e114 now incorrectly states that the linear backoff was removed. Is there any way to fix that?

@JoshRosen
Copy link
Contributor

Doh! Unfortunately, there's no way for us to go back and edit commit messages without messing up the git history. In the future, I'll be more careful when verifying commit message before merging (I should make a review checklist for these sorts of steps...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants