[SPARK-26989][CORE][TEST] DAGSchedulerSuite: ensure listeners are fully processed before checking recorded values #25706

HeartSaVioR · 2019-09-06T01:00:58Z

What changes were proposed in this pull request?

This patch ensures accessing recorded values in listener is always after letting listeners fully process all events. To ensure this, this patch adds new class to hide these values and access with methods which will ensure above condition. Without this guard, two threads are running concurrently - 1) listeners process thread 2) test main thread - and race condition would occur.

That's why we also see very odd thing, error message saying condition is met but test failed:

- Barrier task failures from the same stage attempt don't trigger multiple stage retries *** FAILED ***
  ArrayBuffer(0) did not equal List(0) (DAGSchedulerSuite.scala:2656)

which means verification failed, and condition is met just before constructing error message.

The guard is properly placed in many spots, but missed in some places. This patch enforces that it can't be missed.

Why are the changes needed?

UT fails intermittently and this patch will address the flakyness.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified UT.

Also made the flaky tests artificially failing via applying 50ms of sleep on each onXXX method.

I found 3 methods being failed. (They've marked as X. Just ignore ! as they failed on waiting listener in given timeout and these tests don't deal with these recorded values - it uses other timeout value 1000ms than 10000ms for this listener so affected via side-effect.)

When I applied same in this patch all tests marked as X passed.

…ly processed before checking failedStages

HeartSaVioR · 2019-09-06T01:04:09Z

I'm also seeing inconsistency among this test suite how to verify failedStages. ~~Some places access it directly, while other places access it via scheduler.failedStages.~~ Some places convert it to Set, some other places use contains and length separately, some other places directly compare with Seq. Ideally it seems to be better to deal with this as well, but not sure we would like to deal with this here or another minor PR.

EDIT: failedStages and scheduler.failedStages are different references. My bad.

HeartSaVioR · 2019-09-06T01:06:44Z

Maybe even better to extract the logic about "ensuring listener has no event to process" and "access failedStages" into method and always call this. It should help us not missing to wait for listener thread.

…d values in listener

HeartSaVioR · 2019-09-06T03:22:26Z

There're so many authors in the file, but let me cc. to couple of committers who authored related code or reported this flakyness issue.

cc. @jiangxb1987 @vanzin @tgravescs @dongjoon-hyun

SparkQA · 2019-09-06T03:30:05Z

Test build #110209 has finished for PR 25706 at commit 7808a01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-06T04:23:07Z

Test build #110212 has finished for PR 25706 at commit ea3bc10.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class EventInfoRecordingListener extends SparkListener

vanzin

Maybe I missed it, but was there any test that was actually missing a waitUntilEmpty call? All the ones I noticed that needed it had it, and you've just moved it to shared code.

That isn't bad, but it also makes me question whether you're actually fixing the bug.

As with most timing bugs, you could potentially reproduce it by adding a sleep somewhere (e.g. in the listener that should process the event), and the test should pass even with the sleep in place.

vanzin · 2019-09-06T21:53:36Z