Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Sep 1, 2016

What changes were proposed in this pull request?

After digging into the logs, I noticed the failure is because in this test, it starts a local cluster with 2 executors. However, when SparkContext is created, executors may be still not up. When one of the executor is not up during running the job, the blocks won't be replicated.

This PR just adds a wait loop before running the job to fix the flaky test.

How was this patch tested?

Jenkins

"""
|val timeout = 60000 // 60 seconds
|val start = System.currentTimeMillis
|while(sc.getExecutorStorageStatus.size != 3 &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 = 1 driver + 2 executors

@zsxwing
Copy link
Member Author

zsxwing commented Sep 1, 2016

/cc @ericl

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64739 has finished for PR 14905 at commit 6eeb7f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor

ericl commented Sep 1, 2016

Btw this is done in DistributedSuite using
sc.jobProgressListener.waitUntilExecutorsUp(2, 30000)

On Wed, Aug 31, 2016, 7:19 PM UCB AMPLab [email protected] wrote:

Merged build finished. Test PASSed.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#14905 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6Sj8nDKLld-fz8DI0M5G-p1uZQonCks5qljYYgaJpZM4JyL6p
.

@zsxwing
Copy link
Member Author

zsxwing commented Sep 1, 2016

sc.jobProgressListener.waitUntilExecutorsUp(2, 30000)

It's not a public API. So I cannot use it in the repl

@ericl
Copy link
Contributor

ericl commented Sep 1, 2016

Ah, too bad then. Lgtm

@zsxwing
Copy link
Member Author

zsxwing commented Sep 1, 2016

Thanks! Merging to master and 2.0

asfgit pushed a commit that referenced this pull request Sep 1, 2016
…class defined in repl again

## What changes were proposed in this pull request?

After digging into the logs, I noticed the failure is because in this test, it starts a local cluster with 2 executors. However, when SparkContext is created, executors may be still not up. When one of the executor is not up during running the job, the blocks won't be replicated.

This PR just adds a wait loop before running the job to fix the flaky test.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #14905 from zsxwing/SPARK-17318-2.

(cherry picked from commit 21c0a4f)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in 21c0a4f Sep 1, 2016
@zsxwing zsxwing deleted the SPARK-17318-2 branch September 1, 2016 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants