[SPARK-28467][CORE][TEST] Increase timeout to up executors for tests #25227

huangtianhua · 2019-07-22T06:40:39Z

[SPARK-28467][CORE] Increase timeout to up executors for tests

We use arm instance of vexxhost cloud to run the test, the flavor of the arm instance is 8C8G. And we ran the tests for several times(everytime the instance is new created) and the executor(2 required in test) can't up under 10000ms. The two tests mentioned in [SPARK-28467] is always failed due to timeout(now only these two tests failed due this reason):
test driver discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......
test gpu driver resource files and discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......

And the environment is: Linux ubuntu 4.15.0-46-generic #4916.04.1-Ubuntu SMP Tue Feb 12 17:45:52 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
The timeout doen't work well before, see [SPARK-7989] and [SPARK-10651]. I can't find the principle of the timeout setting, we set it to 20000, because we found the time is about 13000ms then the second executor(2 required in test) can up on our arm testing instances.

This fixes following the solution of [SPARK-7989] and [SPARK-10651].

We ran unit tests on arm64 instance, and there are several tests failed due to executors can't up under the timeout 10000 ms. After increasing the timeout the tests passed. This fixes following the solution of [SPARK-7989] and [SPARK-10651].

AmplabJenkins · 2019-07-22T06:48:30Z

Can one of the admins verify this patch?

dongjoon-hyun

Ur, first of all, please be clear on the PR description about your environment. I don't think all arm64 machines are slow.
Second, do you mean all the other tests passes after this PR in your environment? Otherwise, please include all failures in this PR.

dongjoon-hyun

Unfortunately, there is no failure on EC2 a1.4xlarge. Could you check once more on more powerful machine? The test case is not designed for a tiny machine. At least, it should be a desktop-level machine.

[ec2-user@ip-172-31-59-187 spark]$ uname -a
Linux ip-172-31-59-187.us-west-2.compute.internal 4.14.123-111.109.amzn2.aarch64 #1 SMP Mon Jun 10 19:34:32 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

[ec2-user@ip-172-31-59-187 spark]$ build/sbt "core/testOnly *.SparkContextSuite"
[info] SparkContextSuite:
...
[info] Tests: succeeded 41, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 41, Failed 0, Errors 0, Passed 41
[success] Total time: 399 s, completed Jul 22, 2019 8:10:07 AM

huangtianhua · 2019-07-22T08:35:48Z

@dongjoon-hyun Thanks.
Sorry, I will add the environment:
zuul@ubuntu:/src/github.com/theopenlab/spark$ uname -a
Linux ubuntu 4.15.0-46-generic #4916.04.1-Ubuntu SMP Tue Feb 12 17:45:52 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

We use arm instance of vexxhost cloud to run the test, the flavor of the arm instance is 8C8G. And we ran the tests for several times(everytime the instance is new created) and the executor(2 required in test) can't up under 10000ms. The two tests mentioned in [SPARK-28467] is always failed due to timeout(now only these two tests failed due this reason):
test driver discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......
test gpu driver resource files and discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......

The timeout doen't work well before, see [SPARK-7989] and [SPARK-10651]. I can't find the principle of the timeout setting, we set it to 20000, because we found the time is about 13000ms then the second executor(2 required in test) can up on our arm testing instances. I don't know what's the value is appropriate? But I think, the timeout should not be the blocker of tests, right? Increase the timeout won't 'increase' the test time for normal case , just fix the failed tests due to timeout. Or there is any other suggestion?

dongjoon-hyun · 2019-07-22T08:41:50Z

@huangtianhua . You need to update the PR description instead of commenting here.

dongjoon-hyun · 2019-07-22T08:50:29Z

Is that available in public? I cannot find that one from the following. Could you give me a pointer for that?

https://vexxhost.com/pricing/

huangtianhua · 2019-07-22T08:57:28Z

@dongjoon-hyun Sorry, it's not public. vexxhost donated the resources to OpenLab(https://openlabtesting.org) a community to do open source project testing.

huangtianhua · 2019-07-22T09:00:14Z

@dongjoon-hyun Have you noticed the email [Ask for ARM CI for spark] in [email protected]? May be you are interesting now :)

dongjoon-hyun · 2019-07-22T14:57:35Z

I'm here because I read that email. :)

However, that cannot be a reason to accept this PR. Since this is not a general issue for aarch64, I'm reluctant for this kind of assumption. In general, EC2 is a de-facto standard infra which is more easily accessible to most of the users. If we needs aarch64 support, I'd like to recommend our community to use a1.4xlarge as a standard instance for release testing and benchmarking.

BTW, please don't forget my previous comments. You didn't update this PR according to my advices at all. PMC members can override my opinion.

cc @srowen , @rxin

srowen · 2019-07-22T15:28:49Z

Although this change isn't so bad, as it won't cause faster machines to test more slowly, I'm also hesitant as here just seems causes by using underpowered test machines. We wouldn't up this to help Spark tests on say a small 4 core machine. If a1.4xlarge or equivalent works then let's suggest that this is what the tests need.

huangtianhua · 2019-07-23T08:54:25Z

@srowen Our testing arm instance is 8C(8 core)8G.
I can not find the flavor of the computers for jenkins testing in https://amplab.cs.berkeley.edu/jenkins/computer, could someone tell me, thank you very much.

srowen · 2019-07-23T12:00:15Z

Right now I don't think we're going to run tests in Amplab. I don't think we want the project to commit to testing and fixing ARM issues, not until they are at least all resolved once.

dongjoon-hyun · 2019-07-23T23:45:16Z

@huangtianhua . This looks like a memory issue, 8GB, which is too small instead of ARM64.
If this is a memory issue, I'd like to recommend you to close this issue.

huangtianhua · 2019-07-24T02:35:05Z

@ok, I will test on a larger instance, close this, thank you all.

dongjoon-hyun · 2019-07-24T02:42:48Z

Thank you so much for your understanding, @huangtianhua .

dongjoon-hyun requested changes Jul 22, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Jul 22, 2019

dongjoon-hyun requested changes Jul 22, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-28467][CORE] Increase timeout to up executors for tests~~ [SPARK-28467][CORE][TEST] Increase timeout to up executors for tests Jul 22, 2019

dongjoon-hyun added the TESTS label Jul 22, 2019

huangtianhua closed this Jul 24, 2019

huangtianhua deleted the increase-executor-up-timeout branch September 11, 2019 03:26

[SPARK-28467][CORE][TEST] Increase timeout to up executors for tests #25227

[SPARK-28467][CORE][TEST] Increase timeout to up executors for tests #25227

Uh oh!

Conversation

huangtianhua commented Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Jul 22, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

huangtianhua commented Jul 22, 2019

Uh oh!

dongjoon-hyun commented Jul 22, 2019

Uh oh!

dongjoon-hyun commented Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huangtianhua commented Jul 22, 2019

Uh oh!

huangtianhua commented Jul 22, 2019

Uh oh!

dongjoon-hyun commented Jul 22, 2019

Uh oh!

srowen commented Jul 22, 2019

Uh oh!

huangtianhua commented Jul 23, 2019

Uh oh!

srowen commented Jul 23, 2019

Uh oh!

dongjoon-hyun commented Jul 23, 2019

Uh oh!

huangtianhua commented Jul 24, 2019

Uh oh!

dongjoon-hyun commented Jul 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huangtianhua commented Jul 22, 2019 •

edited

Loading

dongjoon-hyun commented Jul 22, 2019 •

edited

Loading