-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27394][WebUI]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate #24303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27394][WebUI]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate #24303
Changes from all commits
ee53708
1f927a2
5a04be9
289e996
2645f35
39ea357
1c53071
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -100,14 +100,18 @@ class UISeleniumSuite extends SparkFunSuite with WebBrowser with Matchers with B | |
| * Create a test SparkContext with the SparkUI enabled. | ||
| * It is safe to `get` the SparkUI directly from the SparkContext returned here. | ||
| */ | ||
| private def newSparkContext(killEnabled: Boolean = true): SparkContext = { | ||
| private def newSparkContext( | ||
| killEnabled: Boolean = true, | ||
| master: String = "local", | ||
| additionalConfs: Map[String, String] = Map.empty): SparkContext = { | ||
| val conf = new SparkConf() | ||
| .setMaster("local") | ||
| .setMaster(master) | ||
| .setAppName("test") | ||
| .set(UI_ENABLED, true) | ||
| .set(UI_PORT, 0) | ||
| .set(UI_KILL_ENABLED, killEnabled) | ||
| .set(MEMORY_OFFHEAP_SIZE.key, "64m") | ||
| additionalConfs.foreach { case (k, v) => conf.set(k, v) } | ||
| val sc = new SparkContext(conf) | ||
| assert(sc.ui.isDefined) | ||
| sc | ||
|
|
@@ -725,6 +729,31 @@ class UISeleniumSuite extends SparkFunSuite with WebBrowser with Matchers with B | |
| } | ||
| } | ||
|
|
||
| test("Staleness of Spark UI should not last minutes or hours") { | ||
| withSpark(newSparkContext( | ||
| master = "local[2]", | ||
| // Set a small heart beat interval to make the test fast | ||
| additionalConfs = Map( | ||
| EXECUTOR_HEARTBEAT_INTERVAL.key -> "10ms", | ||
| LIVE_ENTITY_UPDATE_MIN_FLUSH_PERIOD.key -> "10ms"))) { sc => | ||
| sc.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "true") | ||
| val f = sc.parallelize(1 to 1000, 1000).foreachAsync { _ => | ||
| // Make the task never finish so there won't be any task start/end events after the first 2 | ||
| // tasks start. | ||
| Thread.sleep(300000) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: what about a less than 5 minutes sleep here something comparable with the eventually, like:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I turned on
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have checked and the thread leaks are gone. |
||
| } | ||
| try { | ||
| eventually(timeout(10.seconds)) { | ||
| val jobsJson = getJson(sc.ui.get, "jobs") | ||
| jobsJson.children.length should be (1) | ||
| (jobsJson.children.head \ "numActiveTasks").extract[Int] should be (2) | ||
| } | ||
| } finally { | ||
| f.cancel() | ||
| } | ||
| } | ||
| } | ||
|
|
||
| def goToUi(sc: SparkContext, path: String): Unit = { | ||
| goToUi(sc.ui.get, path) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... in the bug you mention that job-level data is not being updated. Is that the only case? Because if that's it, then this looks like overkill. You could e.g. update the jobs in the code that handles
event.accumUpdatesabove, or even just flush jobs specifically, instead of everything.Doing a full flush here seems like overkill and a little expensive when you think about many heartbeats arriving in a short period (even when considering
lastFlushTimeNs).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also noticed that executor active tasks sometimes could be wrong. That's why I decided to flush everything to make sure we don't miss any places. It's also hard to maintain if we need to manually flush in every place.
Ideally, we should flush periodically so that it doesn't depend on receiving a Spark event. But then I will need to add a new event type and post to the listener bus. That's overkilled.
At least there will be at least 100ms between each flush. As long as we process heart beats very fast, most of them won't trigger the flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to use the hearbeats as some trigger for flushing, how about using some ratio of the heartbeat period instead of
liveUpdatePeriodNsto control whether to flush everything?Really large apps can get a little backed up when processing hearbeats from lots and lots of busy executors, and this would make it a little worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update only happens in live UI, which should be fine in general. For real large apps, will it help by setting
LIVE_ENTITY_UPDATE_PERIODto a larger value? Setting a ratio of heartbeat period seems a bit complex.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "don't write to the store all the time" thing was added specifically to speed up live UIs, because copying + writing the data (even to the memory store) becomes really expensive when you have event storms (think thousands of tasks starting and stopping in a very short period).
We should avoid requiring configuration tweaks for things not to break, when possible.