Skip to content

Conversation

@thvasilo
Copy link
Contributor

@thvasilo thvasilo commented Mar 5, 2015

As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.

When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.

This solves this by getting the instance types from the existing instances, and overwriting the default options.

EDIT: Further clarification of the issue:

In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.

This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue 5838, where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.

Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

ec2/spark_ec2.py Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not force the instance types to match the existing ones? that's not inherently required, right? I might misunderstand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Perhaps giving the user the option to change the instance type is a good middle ground. But mis-representing the instance type by having the default value can cause problems as we saw, without the user's knowledge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes some sense that the default is to 'same instance type', yes, but not forcing it to be the same one. It could emit an info message if they differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the use case would be:

  1. User starts cluster with some instance type, and later stops it
  2. The user now wants to restart the cluster.
  3. We can either:
    1. Just restart the same instances.
    2. Or allow the user to change the instance types by redefining them in the options passed with the start command.

Is 3.ii a viable use-case? Is it possible to change the type of the instance being re-launched, for example in the case where the cluster also had persistent storage .i.e. extra EBS volumes that are still attached to the stopped instances?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paging @nchammas and or @shivaram for review as I am really not the best person to ask, but, I had expected that you could change the instance type on restart. It may not work for all possible combinations, but it seemed like something this script doesn't have to police -- but also should not prohibit.

@srowen
Copy link
Member

srowen commented Mar 6, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28339 has started for PR 4916 at commit 3ebd52a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28339 has finished for PR 4916 at commit 3ebd52a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28339/
Test FAILed.

@thvasilo
Copy link
Contributor Author

thvasilo commented Mar 6, 2015

Removed trailing whitespace, was that the style problem?

@srowen
Copy link
Member

srowen commented Mar 6, 2015

You can click through to the result of the Jenkins build to see the error:

PEP 8 checks failed.
./ec2/spark_ec2.py:1262:1: W293 blank line contains whitespace
./ec2/spark_ec2.py:1270:1: W293 blank line contains whitespace

You can run them locally to double-check your work with ./dev/lint-python

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28341 has started for PR 4916 at commit 7b32429.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28341 has finished for PR 4916 at commit 7b32429.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28341/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28342 has started for PR 4916 at commit a3d29fe.

  • This patch merges cleanly.

@nchammas
Copy link
Contributor

nchammas commented Mar 6, 2015

@thvasilo can you clarify what flow you are seeing this behavior for? I can't reproduce the issue you're reporting.

Here's what I just tried on master:

  1. Launch a cluster, setting the instance type to m3.large, which is different from the default instance type.

    ./spark-ec2 launch "spark-test" \
    --instance-type m3.large \
    --slaves 1 \
    ...
    
  2. Stop the cluster.

    ./spark-ec2 stop "spark-test" \
    ...
    
  3. Start the cluster without specifying the instance type.

    ./spark-ec2 start "spark-test" \
    ...
    

My cluster came back up with the instance types I originally specified, which is the expected behavior.

@nchammas
Copy link
Contributor

nchammas commented Mar 6, 2015

I also tried to change the instance type when starting a stopped cluster. That didn't work, and the instances were brought up with their original instance types.

I think that is potentially something we could improve.

@SparkQA
Copy link

SparkQA commented Mar 6, 2015

Test build #28342 has finished for PR 4916 at commit a3d29fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28342/
Test PASSed.

@thvasilo
Copy link
Contributor Author

thvasilo commented Mar 6, 2015

@nchammas The instances do come up with the correct type, the problem is more the one described here

In short, while the instances themselves are OK, their setup is done assuming the default instance type, m1.large.

This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue 5838, where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.

I will update issue 6188 to include this information.

@nchammas
Copy link
Contributor

nchammas commented Mar 6, 2015

Ah, I see. Thank you for the clarification. I would add this into the PR body, since that becomes part of the commit message if this PR gets merged.

@srowen
Copy link
Member

srowen commented Mar 7, 2015

I understand the goal to be:

  • If I do specify an instance type T on restart, launch instance type T and configure for T regardless of previous instance type
  • If I don't specify an instance type, the launch and configure correctly for the previous instance type

This change seems always force the new instance type to match the old instance type though. Is the argument that this should not be allowed any longer? or am I mistaken and it was never allowed?

In any event, of course, whatever the instance type is, it should be configured correctly for that instance type!

@thvasilo
Copy link
Contributor Author

thvasilo commented Mar 7, 2015

Is the argument that this should not be allowed any longer? or am I mistaken and it was never allowed?

I am also not sure about this, it would be cool to allow the user to change the instance type if that is possible, otherwise just relaunching the same cluster with the same instance types seems like a good compromise.

@srowen
Copy link
Member

srowen commented Mar 7, 2015

I guess I'm trying to figure out if this change subtracts existing functionality. That's potentially controversial and what I would hesitate over.

If it merely doesn't add this functionality, that seems fine to me. I'm not necessarily arguing that changing instance type should be possible, either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question here too, does this assume there exists a running master node? it's referencing the first running master node, but if there aren't any? or is that handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is handled in get_existing_cluster, which is called in line 1247

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comments earlier, this change does not remove existing
functionality. It is currently not possible to change instance types when
starting a stopped cluster.
2015년 3월 7일 (토) 오후 12:08, Theodore Vasiloudis [email protected]님이
작성:

In ec2/spark_ec2.py
#4916 (comment):

@@ -1259,6 +1259,15 @@ def real_main():
cluster_instances=(master_nodes + slave_nodes),
cluster_state='ssh-ready'
)
+

  •    # Determine types of running instances
    
  •    existing_master_type = master_nodes[0].instance_type
    

Yup, this is handled in get_existing_cluster
https://github.com/thvasilo/spark/blob/SPARK-6188%5D-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments/ec2/spark_ec2.py#L617


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/4916/files#r25997522.

@shivaram
Copy link
Contributor

shivaram commented Mar 7, 2015

@srowen This change doesn't remove any existing functionality as far as I can see. As @nchammas said if you stopped instances of a certain type you can only get the same type back. Right now we implicitly assumed users would pass the same flag, and this handles the use case where the user doesn't pass the right flag.

I had a minor comment inline, but otherwise this change looks good.

@srowen
Copy link
Member

srowen commented Mar 8, 2015

I'll wait another day in case the consensus is that this line should indeed change, but it sounds like it is likely fine as is.

@shivaram
Copy link
Contributor

shivaram commented Mar 8, 2015

@srowen I'm fine with the line either way and it shouldn't really matter. As I described above the problem of master_instance_type being different from slaves is a different issue not related to this PR.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28387 has started for PR 4916 at commit 6705b98.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28387 has finished for PR 4916 at commit 6705b98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28387/
Test PASSed.

@asfgit asfgit closed this in f7c7992 Mar 9, 2015
@thvasilo thvasilo deleted the SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments branch March 9, 2015 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants