-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28467][CORE][TEST] Increase timeout to up executors for tests #25227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28467][CORE][TEST] Increase timeout to up executors for tests #25227
Conversation
We ran unit tests on arm64 instance, and there are several tests failed due to executors can't up under the timeout 10000 ms. After increasing the timeout the tests passed. This fixes following the solution of [SPARK-7989] and [SPARK-10651].
|
Can one of the admins verify this patch? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, first of all, please be clear on the PR description about your environment. I don't think all arm64 machines are slow.
Second, do you mean all the other tests passes after this PR in your environment? Otherwise, please include all failures in this PR.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there is no failure on EC2 a1.4xlarge. Could you check once more on more powerful machine? The test case is not designed for a tiny machine. At least, it should be a desktop-level machine.
[ec2-user@ip-172-31-59-187 spark]$ uname -a
Linux ip-172-31-59-187.us-west-2.compute.internal 4.14.123-111.109.amzn2.aarch64 #1 SMP Mon Jun 10 19:34:32 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
[ec2-user@ip-172-31-59-187 spark]$ build/sbt "core/testOnly *.SparkContextSuite"
[info] SparkContextSuite:
...
[info] Tests: succeeded 41, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 41, Failed 0, Errors 0, Passed 41
[success] Total time: 399 s, completed Jul 22, 2019 8:10:07 AM
|
@dongjoon-hyun Thanks. We use arm instance of vexxhost cloud to run the test, the flavor of the arm instance is 8C8G. And we ran the tests for several times(everytime the instance is new created) and the executor(2 required in test) can't up under 10000ms. The two tests mentioned in [SPARK-28467] is always failed due to timeout(now only these two tests failed due this reason): The timeout doen't work well before, see [SPARK-7989] and [SPARK-10651]. I can't find the principle of the timeout setting, we set it to 20000, because we found the time is about 13000ms then the second executor(2 required in test) can up on our arm testing instances. I don't know what's the value is appropriate? But I think, the timeout should not be the blocker of tests, right? Increase the timeout won't 'increase' the test time for normal case , just fix the failed tests due to timeout. Or there is any other suggestion? |
|
@huangtianhua . You need to update the PR description instead of commenting here. |
|
Is that available in public? I cannot find that one from the following. Could you give me a pointer for that? |
|
@dongjoon-hyun Sorry, it's not public. vexxhost donated the resources to OpenLab(https://openlabtesting.org) a community to do open source project testing. |
|
@dongjoon-hyun Have you noticed the email [Ask for ARM CI for spark] in [email protected]? May be you are interesting now :) |
|
I'm here because I read that email. :) However, that cannot be a reason to accept this PR. Since this is not a general issue for BTW, please don't forget my previous comments. You didn't update this PR according to my advices at all. PMC members can override my opinion. |
|
Although this change isn't so bad, as it won't cause faster machines to test more slowly, I'm also hesitant as here just seems causes by using underpowered test machines. We wouldn't up this to help Spark tests on say a small 4 core machine. If a1.4xlarge or equivalent works then let's suggest that this is what the tests need. |
|
@srowen Our testing arm instance is 8C(8 core)8G. |
|
Right now I don't think we're going to run tests in Amplab. I don't think we want the project to commit to testing and fixing ARM issues, not until they are at least all resolved once. |
|
@huangtianhua . This looks like a memory issue, |
|
@ok, I will test on a larger instance, close this, thank you all. |
|
Thank you so much for your understanding, @huangtianhua . |
[SPARK-28467][CORE] Increase timeout to up executors for tests
We use arm instance of vexxhost cloud to run the test, the flavor of the arm instance is 8C8G. And we ran the tests for several times(everytime the instance is new created) and the executor(2 required in test) can't up under 10000ms. The two tests mentioned in [SPARK-28467] is always failed due to timeout(now only these two tests failed due this reason):
test driver discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......
test gpu driver resource files and discovery under local-cluster mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
......
And the environment is: Linux ubuntu 4.15.0-46-generic #4916.04.1-Ubuntu SMP Tue Feb 12 17:45:52 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
The timeout doen't work well before, see [SPARK-7989] and [SPARK-10651]. I can't find the principle of the timeout setting, we set it to 20000, because we found the time is about 13000ms then the second executor(2 required in test) can up on our arm testing instances.
This fixes following the solution of [SPARK-7989] and [SPARK-10651].