[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments #4916

thvasilo · 2015-03-05T17:18:16Z

As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.

When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.

This solves this by getting the instance types from the existing instances, and overwriting the default options.

EDIT: Further clarification of the issue:

In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.

This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue 5838, where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.

Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.

AmplabJenkins · 2015-03-05T17:22:11Z

Can one of the admins verify this patch?

srowen · 2015-03-05T17:22:42Z

ec2/spark_ec2.py

Does this not force the instance types to match the existing ones? that's not inherently required, right? I might misunderstand.

You are correct. Perhaps giving the user the option to change the instance type is a good middle ground. But mis-representing the instance type by having the default value can cause problems as we saw, without the user's knowledge.

It makes some sense that the default is to 'same instance type', yes, but not forcing it to be the same one. It could emit an info message if they differ.

OK so the use case would be:

User starts cluster with some instance type, and later stops it

The user now wants to restart the cluster.

We can either:

Just restart the same instances.

Or allow the user to change the instance types by redefining them in the options passed with the start command.

Is 3.ii a viable use-case? Is it possible to change the type of the instance being re-launched, for example in the case where the cluster also had persistent storage .i.e. extra EBS volumes that are still attached to the stopped instances?

Paging @nchammas and or @shivaram for review as I am really not the best person to ask, but, I had expected that you could change the instance type on restart. It may not work for all possible combinations, but it seemed like something this script doesn't have to police -- but also should not prohibit.

srowen · 2015-03-06T13:04:00Z

ok to test

SparkQA · 2015-03-06T13:07:47Z

Test build #28339 has started for PR 4916 at commit 3ebd52a.

This patch merges cleanly.

SparkQA · 2015-03-06T13:08:46Z

Test build #28339 has finished for PR 4916 at commit 3ebd52a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-06T13:08:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28339/
Test FAILed.

thvasilo · 2015-03-06T13:20:26Z

Removed trailing whitespace, was that the style problem?

srowen · 2015-03-06T13:21:40Z

You can click through to the result of the Jenkins build to see the error:

PEP 8 checks failed.
./ec2/spark_ec2.py:1262:1: W293 blank line contains whitespace
./ec2/spark_ec2.py:1270:1: W293 blank line contains whitespace

You can run them locally to double-check your work with ./dev/lint-python

SparkQA · 2015-03-06T13:22:44Z

Test build #28341 has started for PR 4916 at commit 7b32429.

This patch merges cleanly.

SparkQA · 2015-03-06T13:23:42Z

Test build #28341 has finished for PR 4916 at commit 7b32429.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-06T13:23:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28341/
Test FAILed.

SparkQA · 2015-03-06T13:28:07Z

Test build #28342 has started for PR 4916 at commit a3d29fe.

This patch merges cleanly.

nchammas · 2015-03-06T14:18:11Z

@thvasilo can you clarify what flow you are seeing this behavior for? I can't reproduce the issue you're reporting.

Here's what I just tried on master:

Launch a cluster, setting the instance type to m3.large, which is different from the default instance type.
```
./spark-ec2 launch "spark-test" \
--instance-type m3.large \
--slaves 1 \
...
```
Stop the cluster.
```
./spark-ec2 stop "spark-test" \
...
```
Start the cluster without specifying the instance type.
```
./spark-ec2 start "spark-test" \
...
```

My cluster came back up with the instance types I originally specified, which is the expected behavior.

nchammas · 2015-03-06T14:31:59Z

I also tried to change the instance type when starting a stopped cluster. That didn't work, and the instances were brought up with their original instance types.

I think that is potentially something we could improve.

SparkQA · 2015-03-06T14:50:05Z

Test build #28342 has finished for PR 4916 at commit a3d29fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-06T14:50:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28342/
Test PASSed.

thvasilo · 2015-03-06T15:55:26Z

@nchammas The instances do come up with the correct type, the problem is more the one described here

In short, while the instances themselves are OK, their setup is done assuming the default instance type, m1.large.

This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue 5838, where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.

I will update issue 6188 to include this information.

nchammas · 2015-03-06T18:29:55Z

Ah, I see. Thank you for the clarification. I would add this into the PR body, since that becomes part of the commit message if this PR gets merged.

srowen · 2015-03-07T12:54:14Z

I understand the goal to be:

If I do specify an instance type T on restart, launch instance type T and configure for T regardless of previous instance type
If I don't specify an instance type, the launch and configure correctly for the previous instance type

This change seems always force the new instance type to match the old instance type though. Is the argument that this should not be allowed any longer? or am I mistaken and it was never allowed?

In any event, of course, whatever the instance type is, it should be configured correctly for that instance type!

thvasilo · 2015-03-07T16:19:16Z

Is the argument that this should not be allowed any longer? or am I mistaken and it was never allowed?

I am also not sure about this, it would be cool to allow the user to change the instance type if that is possible, otherwise just relaunching the same cluster with the same instance types seems like a good compromise.

srowen · 2015-03-07T16:29:18Z

I guess I'm trying to figure out if this change subtracts existing functionality. That's potentially controversial and what I would hesitate over.

If it merely doesn't add this functionality, that seems fine to me. I'm not necessarily arguing that changing instance type should be possible, either.

srowen · 2015-03-07T16:30:01Z

ec2/spark_ec2.py

Dumb question here too, does this assume there exists a running master node? it's referencing the first running master node, but if there aren't any? or is that handled?

Yup, this is handled in get_existing_cluster, which is called in line 1247

Per my comments earlier, this change does not remove existing
functionality. It is currently not possible to change instance types when
starting a stopped cluster.
2015년 3월 7일 (토) 오후 12:08, Theodore Vasiloudis [email protected]님이
작성:

In ec2/spark_ec2.py
#4916 (comment):

@@ -1259,6 +1259,15 @@ def real_main():
cluster_instances=(master_nodes + slave_nodes),
cluster_state='ssh-ready'
)
+

# Determine types of running instances

existing_master_type = master_nodes[0].instance_type

Yup, this is handled in get_existing_cluster
https://github.com/thvasilo/spark/blob/SPARK-6188%5D-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments/ec2/spark_ec2.py#L617

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/4916/files#r25997522.

shivaram · 2015-03-07T21:05:47Z

@srowen This change doesn't remove any existing functionality as far as I can see. As @nchammas said if you stopped instances of a certain type you can only get the same type back. Right now we implicitly assumed users would pass the same flag, and this handles the use case where the user doesn't pass the right flag.

I had a minor comment inline, but otherwise this change looks good.

srowen · 2015-03-08T14:05:15Z

I'll wait another day in case the consensus is that this line should indeed change, but it sounds like it is likely fine as is.

shivaram · 2015-03-08T23:12:30Z

@srowen I'm fine with the line either way and it shouldn't really matter. As I described above the problem of master_instance_type being different from slaves is a different issue not related to this PR.

…ring.

SparkQA · 2015-03-09T08:52:48Z

Test build #28387 has started for PR 4916 at commit 6705b98.

This patch merges cleanly.

SparkQA · 2015-03-09T10:12:50Z

Test build #28387 has finished for PR 4916 at commit 6705b98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-09T10:12:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28387/
Test PASSed.

Make sure that the instance type is correct when relaunching a cluster.

3ebd52a

srowen reviewed Mar 5, 2015
View reviewed changes

Removed trailing whitespace

7b32429

More trailing whitespace

a3d29fe

srowen reviewed Mar 7, 2015
View reviewed changes

Added comment to clarify setting master instance type to the empty st…

6705b98

…ring.

asfgit closed this in f7c7992 Mar 9, 2015

thvasilo deleted the SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments branch March 9, 2015 15:47

kxepal mentioned this pull request Mar 23, 2015

tooling: automatically test for code style apache/couchdb-fauxton#326

Closed

huangtianhua mentioned this pull request Jul 23, 2019

[SPARK-28467][CORE][TEST] Increase timeout to up executors for tests #25227

Closed

[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments #4916

[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments #4916

Uh oh!

Conversation

thvasilo commented Mar 5, 2015

Uh oh!

AmplabJenkins commented Mar 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 6, 2015

Uh oh!

AmplabJenkins commented Mar 6, 2015

Uh oh!

thvasilo commented Mar 6, 2015

Uh oh!

srowen commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 6, 2015

Uh oh!

AmplabJenkins commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 6, 2015

Uh oh!

nchammas commented Mar 6, 2015

Uh oh!

nchammas commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 6, 2015

Uh oh!

AmplabJenkins commented Mar 6, 2015

Uh oh!

thvasilo commented Mar 6, 2015

Uh oh!

nchammas commented Mar 6, 2015

Uh oh!

srowen commented Mar 7, 2015

Uh oh!

thvasilo commented Mar 7, 2015

Uh oh!

srowen commented Mar 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Mar 7, 2015

Uh oh!

srowen commented Mar 8, 2015

Uh oh!

shivaram commented Mar 8, 2015

Uh oh!

SparkQA commented Mar 9, 2015

Uh oh!

SparkQA commented Mar 9, 2015

Uh oh!

AmplabJenkins commented Mar 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development