-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3398] [SPARK-4325] [EC2] Use EC2 status checks. #3195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #23175 has started for PR 3195 at commit
|
|
Test build #23175 has finished for PR 3195 at commit
|
|
Test PASSed. |
|
Pinging @JoshRosen and @shivaram again. This PR re-introduces a commit that somehow disappeared from #2339, and I think it should be good for inclusion in 1.2.0, since that's what it was originally slated for. |
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is capturing the output not required anymore ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it never was.
When I originally added this in, I was testing in the Python shell and got confused. i.update() echoes its return value to the shell, so I thought I needed to suppress that output. But as part of a program, nothing gets echoed. We don't need the return value, so there's no need to capture it.
It's like just typing var in the shell and seeing its value. As part of a program, var alone will do nothing.
|
@nchammas I am bit confused about this change. The EC2 instance status checks seemed to be working fine when I tried things out last week, so I am not sure what we are missing. Is it just removing the exponential back-off ? |
|
Remember we were concerned about forking too many processes to test SSH. Without this patch, Since this patch sharply reduces the number of SSH checks we need to make, there is no need for a back-off. The back-off was originally introduced as a measure to reduce the number of SSH checks. |
|
Test build #23556 has started for PR 3195 at commit
|
|
Test build #23556 has finished for PR 3195 at commit
|
|
Test PASSed. |
|
@shivaram I think this PR is good to go. Did my response to your questions make sense? |
|
Yeah this makes sense, thanks for clarifying. One thing I am still thinking about is if it makes sense to do constant poll vs. linear back-off. From what I can see we make the EC2 status checks now every 5 seconds. I think that should be fine, though in the past I have run into cases where EC2 requests get rate-limited and start failing. @JoshRosen any other thoughts ? |
|
This seems okay to me, I guess. If we're being really conservative, I guess we could just leave the linear back-off in place. |
|
Yeah, I removed it specifically to be more aggressive and shave off some
|
|
I guess I'd prefer to leave it in, but that's only because I don't want to risk a regression and don't really have time to figure out whether removing it is safe. |
This reverts commit adb4eaa and increases the wait interval to 5 seconds.
|
OK @JoshRosen, I've added the backoff back in. I didn't notice any regressions without the linear backoff, so I'll make a note to remove it in a future version and document the time savings. Fair enough? |
|
Test build #23945 has started for PR 3195 at commit
|
|
Test build #23945 has finished for PR 3195 at commit
|
|
Test PASSed. |
|
LGTM, so I'm going to merge this into |
|
Hmm, looks like I didn't resolve the JIRA issue here. Should this also be targeted for any backports, or is it fine to just have this in 1.3.0? |
|
@JoshRosen This work was originally supposed to be in for 1.2.0, but since it's not a critical piece of work I don't think there is a need to backport it for 1.2.1 or whatnot. All it does is reduce the number of SSH calls made during launch by using the EC2 status checks, so I think it's OK to leave it for 1.3.0. |
|
Alright, in that case I'm going to mark SPARK-4325 as 'Resolved'. Thanks! |
|
FYI: The commit message at 317e114 now incorrectly states that the linear backoff was removed. Is there any way to fix that? |
|
Doh! Unfortunately, there's no way for us to go back and edit commit messages without messing up the git history. In the future, I'll be more careful when verifying commit message before merging (I should make a review checklist for these sorts of steps...) |
This PR re-introduces 0e648bc from PR #2339, which somehow never made it into the codebase.
Additionally, it removes a now-unnecessary linear backoff on the SSH checks since we are blocking on EC2 status checks before testing SSH.