-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments #4916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments #4916
Conversation
|
Can one of the admins verify this patch? |
ec2/spark_ec2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not force the instance types to match the existing ones? that's not inherently required, right? I might misunderstand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. Perhaps giving the user the option to change the instance type is a good middle ground. But mis-representing the instance type by having the default value can cause problems as we saw, without the user's knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes some sense that the default is to 'same instance type', yes, but not forcing it to be the same one. It could emit an info message if they differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the use case would be:
- User starts cluster with some instance type, and later stops it
- The user now wants to restart the cluster.
- We can either:
- Just restart the same instances.
- Or allow the user to change the instance types by redefining them in the options passed with the start command.
Is 3.ii a viable use-case? Is it possible to change the type of the instance being re-launched, for example in the case where the cluster also had persistent storage .i.e. extra EBS volumes that are still attached to the stopped instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paging @nchammas and or @shivaram for review as I am really not the best person to ask, but, I had expected that you could change the instance type on restart. It may not work for all possible combinations, but it seemed like something this script doesn't have to police -- but also should not prohibit.
|
ok to test |
|
Test build #28339 has started for PR 4916 at commit
|
|
Test build #28339 has finished for PR 4916 at commit
|
|
Test FAILed. |
|
Removed trailing whitespace, was that the style problem? |
|
You can click through to the result of the Jenkins build to see the error: You can run them locally to double-check your work with |
|
Test build #28341 has started for PR 4916 at commit
|
|
Test build #28341 has finished for PR 4916 at commit
|
|
Test FAILed. |
|
Test build #28342 has started for PR 4916 at commit
|
|
@thvasilo can you clarify what flow you are seeing this behavior for? I can't reproduce the issue you're reporting. Here's what I just tried on
My cluster came back up with the instance types I originally specified, which is the expected behavior. |
|
I also tried to change the instance type when starting a stopped cluster. That didn't work, and the instances were brought up with their original instance types. I think that is potentially something we could improve. |
|
Test build #28342 has finished for PR 4916 at commit
|
|
Test PASSed. |
|
@nchammas The instances do come up with the correct type, the problem is more the one described here In short, while the instances themselves are OK, their setup is done assuming the default instance type, m1.large. This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue 5838, where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors. I will update issue 6188 to include this information. |
|
Ah, I see. Thank you for the clarification. I would add this into the PR body, since that becomes part of the commit message if this PR gets merged. |
|
I understand the goal to be:
This change seems always force the new instance type to match the old instance type though. Is the argument that this should not be allowed any longer? or am I mistaken and it was never allowed? In any event, of course, whatever the instance type is, it should be configured correctly for that instance type! |
I am also not sure about this, it would be cool to allow the user to change the instance type if that is possible, otherwise just relaunching the same cluster with the same instance types seems like a good compromise. |
|
I guess I'm trying to figure out if this change subtracts existing functionality. That's potentially controversial and what I would hesitate over. If it merely doesn't add this functionality, that seems fine to me. I'm not necessarily arguing that changing instance type should be possible, either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question here too, does this assume there exists a running master node? it's referencing the first running master node, but if there aren't any? or is that handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is handled in get_existing_cluster, which is called in line 1247
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comments earlier, this change does not remove existing
functionality. It is currently not possible to change instance types when
starting a stopped cluster.
2015년 3월 7일 (토) 오후 12:08, Theodore Vasiloudis [email protected]님이
작성:
In ec2/spark_ec2.py
#4916 (comment):@@ -1259,6 +1259,15 @@ def real_main():
cluster_instances=(master_nodes + slave_nodes),
cluster_state='ssh-ready'
)
+
# Determine types of running instancesexisting_master_type = master_nodes[0].instance_typeYup, this is handled in get_existing_cluster
https://github.com/thvasilo/spark/blob/SPARK-6188%5D-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments/ec2/spark_ec2.py#L617—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/4916/files#r25997522.
|
@srowen This change doesn't remove any existing functionality as far as I can see. As @nchammas said if you stopped instances of a certain type you can only get the same type back. Right now we implicitly assumed users would pass the same flag, and this handles the use case where the user doesn't pass the right flag. I had a minor comment inline, but otherwise this change looks good. |
|
I'll wait another day in case the consensus is that this line should indeed change, but it sounds like it is likely fine as is. |
|
@srowen I'm fine with the line either way and it shouldn't really matter. As I described above the problem of master_instance_type being different from slaves is a different issue not related to this PR. |
|
Test build #28387 has started for PR 4916 at commit
|
|
Test build #28387 has finished for PR 4916 at commit
|
|
Test PASSed. |
As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.
When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.
This solves this by getting the instance types from the existing instances, and overwriting the default options.
EDIT: Further clarification of the issue:
In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.
This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue 5838, where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.
Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.