Skip to content

Conversation

@huangtianhua
Copy link
Contributor

@huangtianhua huangtianhua commented Jul 22, 2019

[SPARK-28467][CORE] Increase timeout to up executors for tests

We use arm instance of vexxhost cloud to run the test, the flavor of the arm instance is 8C8G. And we ran the tests for several times(everytime the instance is new created) and the executor(2 required in test) can't up under 10000ms. The two tests mentioned in [SPARK-28467] is always failed due to timeout(now only these two tests failed due this reason):
test driver discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......
test gpu driver resource files and discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......

And the environment is: Linux ubuntu 4.15.0-46-generic #4916.04.1-Ubuntu SMP Tue Feb 12 17:45:52 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
The timeout doen't work well before, see [SPARK-7989] and [SPARK-10651]. I can't find the principle of the timeout setting, we set it to 20000, because we found the time is about 13000ms then the second executor(2 required in test) can up on our arm testing instances.

This fixes following the solution of [SPARK-7989] and [SPARK-10651].

We ran unit tests on arm64 instance, and there are
several tests failed due to executors can't up
under the timeout 10000 ms. After increasing the timeout
the tests passed.

This fixes following the solution of [SPARK-7989] and [SPARK-10651].
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, first of all, please be clear on the PR description about your environment. I don't think all arm64 machines are slow.
Second, do you mean all the other tests passes after this PR in your environment? Otherwise, please include all failures in this PR.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there is no failure on EC2 a1.4xlarge. Could you check once more on more powerful machine? The test case is not designed for a tiny machine. At least, it should be a desktop-level machine.

[ec2-user@ip-172-31-59-187 spark]$ uname -a
Linux ip-172-31-59-187.us-west-2.compute.internal 4.14.123-111.109.amzn2.aarch64 #1 SMP Mon Jun 10 19:34:32 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

[ec2-user@ip-172-31-59-187 spark]$ build/sbt "core/testOnly *.SparkContextSuite"
[info] SparkContextSuite:
...
[info] Tests: succeeded 41, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 41, Failed 0, Errors 0, Passed 41
[success] Total time: 399 s, completed Jul 22, 2019 8:10:07 AM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28467][CORE] Increase timeout to up executors for tests [SPARK-28467][CORE][TEST] Increase timeout to up executors for tests Jul 22, 2019
@huangtianhua
Copy link
Contributor Author

@dongjoon-hyun Thanks.
Sorry, I will add the environment:
zuul@ubuntu:/src/github.com/theopenlab/spark$ uname -a
Linux ubuntu 4.15.0-46-generic #49
16.04.1-Ubuntu SMP Tue Feb 12 17:45:52 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

We use arm instance of vexxhost cloud to run the test, the flavor of the arm instance is 8C8G. And we ran the tests for several times(everytime the instance is new created) and the executor(2 required in test) can't up under 10000ms. The two tests mentioned in [SPARK-28467] is always failed due to timeout(now only these two tests failed due this reason):
test driver discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......
test gpu driver resource files and discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......

The timeout doen't work well before, see [SPARK-7989] and [SPARK-10651]. I can't find the principle of the timeout setting, we set it to 20000, because we found the time is about 13000ms then the second executor(2 required in test) can up on our arm testing instances. I don't know what's the value is appropriate? But I think, the timeout should not be the blocker of tests, right? Increase the timeout won't 'increase' the test time for normal case , just fix the failed tests due to timeout. Or there is any other suggestion?

@dongjoon-hyun
Copy link
Member

@huangtianhua . You need to update the PR description instead of commenting here.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 22, 2019

Is that available in public? I cannot find that one from the following. Could you give me a pointer for that?

@huangtianhua
Copy link
Contributor Author

@dongjoon-hyun Sorry, it's not public. vexxhost donated the resources to OpenLab(https://openlabtesting.org) a community to do open source project testing.

@huangtianhua
Copy link
Contributor Author

@dongjoon-hyun Have you noticed the email [Ask for ARM CI for spark] in [email protected]? May be you are interesting now :)

@dongjoon-hyun
Copy link
Member

I'm here because I read that email. :)

However, that cannot be a reason to accept this PR. Since this is not a general issue for aarch64, I'm reluctant for this kind of assumption. In general, EC2 is a de-facto standard infra which is more easily accessible to most of the users. If we needs aarch64 support, I'd like to recommend our community to use a1.4xlarge as a standard instance for release testing and benchmarking.

BTW, please don't forget my previous comments. You didn't update this PR according to my advices at all. PMC members can override my opinion.

cc @srowen , @rxin

@srowen
Copy link
Member

srowen commented Jul 22, 2019

Although this change isn't so bad, as it won't cause faster machines to test more slowly, I'm also hesitant as here just seems causes by using underpowered test machines. We wouldn't up this to help Spark tests on say a small 4 core machine. If a1.4xlarge or equivalent works then let's suggest that this is what the tests need.

@huangtianhua
Copy link
Contributor Author

@srowen Our testing arm instance is 8C(8 core)8G.
I can not find the flavor of the computers for jenkins testing in https://amplab.cs.berkeley.edu/jenkins/computer, could someone tell me, thank you very much.

@srowen
Copy link
Member

srowen commented Jul 23, 2019

Right now I don't think we're going to run tests in Amplab. I don't think we want the project to commit to testing and fixing ARM issues, not until they are at least all resolved once.

@dongjoon-hyun
Copy link
Member

@huangtianhua . This looks like a memory issue, 8GB, which is too small instead of ARM64.
If this is a memory issue, I'd like to recommend you to close this issue.

@huangtianhua
Copy link
Contributor Author

@ok, I will test on a larger instance, close this, thank you all.

@dongjoon-hyun
Copy link
Member

Thank you so much for your understanding, @huangtianhua .

@huangtianhua huangtianhua deleted the increase-executor-up-timeout branch September 11, 2019 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants