-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-52914][CORE] Support On-Demand Log Loading for rolling logs in History Server
#51604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
600eb8a
ef9b833
db5773b
834ba16
4ee04af
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -321,9 +321,13 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) | |
| override def getLastUpdatedTime(): Long = lastScanTime.get() | ||
|
|
||
| override def getAppUI(appId: String, attemptId: Option[String]): Option[LoadedAppUI] = { | ||
| val logPath = RollingEventLogFilesWriter.EVENT_LOG_DIR_NAME_PREFIX + | ||
| EventLogFileWriter.nameForAppAndAttempt(appId, attemptId) | ||
| val app = try { | ||
| load(appId) | ||
| } catch { | ||
| case _: NoSuchElementException if this.conf.get(EVENT_LOG_ROLLING_ON_DEMAND_LOAD_ENABLED) => | ||
| loadFromFallbackLocation(appId, attemptId, logPath) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Initially, I proposed the config name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I see. A bit confusing setup to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that it sounds confusing. Basically, it's the same for event log compression codec. Since a Spark job can choose |
||
| case _: NoSuchElementException => | ||
| return None | ||
| } | ||
|
|
@@ -345,6 +349,13 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) | |
| createInMemoryStore(attempt) | ||
| } | ||
| } catch { | ||
| case _: FileNotFoundException if this.conf.get(EVENT_LOG_ROLLING_ON_DEMAND_LOAD_ENABLED) => | ||
| if (app.attempts.head.info.appSparkVersion == "unknown") { | ||
| listing.synchronized { | ||
| listing.delete(classOf[ApplicationInfoWrapper], appId) | ||
| } | ||
| } | ||
| return None | ||
|
Comment on lines
+352
to
+358
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I don't quite understand what this did. Seems But what does this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for review, @viirya . Yes, this adds a dummy record (based on the user request) to proceed to load the actual file. However, if the actual file doesn't exist, |
||
| case _: FileNotFoundException => | ||
| return None | ||
| } | ||
|
|
@@ -364,6 +375,18 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) | |
| Some(loadedUI) | ||
| } | ||
|
|
||
| private def loadFromFallbackLocation(appId: String, attemptId: Option[String], logPath: String) | ||
| : ApplicationInfoWrapper = { | ||
| val date = new Date(0) | ||
| val lastUpdate = new Date() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| val info = ApplicationAttemptInfo( | ||
| attemptId, date, date, lastUpdate, 0, "spark", false, "unknown") | ||
| addListing(new ApplicationInfoWrapper( | ||
| ApplicationInfo(appId, appId, None, None, None, None, List.empty), | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So supposedly, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And once periodic scanning happens, it will update the record with correct information? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, correct, @viirya ~ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a kind of placeholder. |
||
| List(new AttemptInfoWrapper(info, logPath, 0, Some(1), None, None, None, None)))) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we rely on the event log for information like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is only a dummy place to allow SHS shows the application logs before periodic scanning happens. The periodic scanning will keep it in sync. BTW, I'm wondering how many times do you think this fallback is used in the production environments, @thejdeep ? I'm curious if you are thinking about turning off the periodic scanning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see that the intention is just to have dummy placeholders until the scanning takes care of it. If users operate with a large Spark cluster, my two cents are that users may tend to access their app on demand much more frequently and it might just lead to a incorrect listing page. For example, we noticed that a good fraction of our SHS requests are on demand since users would like to get their reports as soon as their app finishes and before There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, and technically, it's not exposed in the listing page. Could you build this PR and test it by yourself?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like a limitation of a single file event log, @thejdeep . If you have rolling event logs, SHS have the correct partial information already while your jobs are running.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just questions to understand your use cases:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for sharing context @dongjoon-hyun . We currently do not use rolling event logs since we only currently serve batch use-cases. All applications are currently on 3.x. I can build your PR locally and test it on single file event logs to see how it works with listing and cleanup. I can get back to you earliest by tomorrow if that works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you so much for the info and your efforts on reviewing this. Take your time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dongjoon-hyun wanted to get your thoughts on #51604 (comment) Thank you! |
||
| load(appId) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the behavior if the application does not exist ? (typo in user query for example) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, do you think it will be better if we check for the existence of the file at its location before adding an entry ? This is to keep parity with how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, we had better avoid that because it requires the full path including "s3://...", @thejdeep . |
||
| } | ||
|
|
||
| override def getEmptyListingHtml(): Seq[Node] = { | ||
| <p> | ||
| Did you specify the correct logging directory? Please verify your setting of | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1640,6 +1640,40 @@ abstract class FsHistoryProviderSuite extends SparkFunSuite with Matchers with P | |
| } | ||
| } | ||
|
|
||
| test("SPARK-52914: Support spark.history.fs.eventLog.rolling.onDemandLoadEnabled") { | ||
| Seq(true, false).foreach { onDemandEnabled => | ||
| withTempDir { dir => | ||
| val conf = createTestConf(true) | ||
| conf.set(HISTORY_LOG_DIR, dir.getAbsolutePath) | ||
| conf.set(EVENT_LOG_ROLLING_ON_DEMAND_LOAD_ENABLED, onDemandEnabled) | ||
| val hadoopConf = SparkHadoopUtil.newConfiguration(conf) | ||
| val provider = new FsHistoryProvider(conf) | ||
|
|
||
| val writer1 = new RollingEventLogFilesWriter("app1", None, dir.toURI, conf, hadoopConf) | ||
| writer1.start() | ||
| writeEventsToRollingWriter(writer1, Seq( | ||
| SparkListenerApplicationStart("app1", Some("app1"), 0, "user", None), | ||
| SparkListenerJobStart(1, 0, Seq.empty)), rollFile = false) | ||
| writer1.stop() | ||
|
|
||
| assert(dir.listFiles().length === 1) | ||
| assert(provider.getListing().length === 0) | ||
| assert(provider.getAppUI("app1", None).isDefined == onDemandEnabled) | ||
| assert(provider.getListing().length === (if (onDemandEnabled) 1 else 0)) | ||
|
|
||
| // The dummy entry should be protected from cleanLogs() | ||
| provider.cleanLogs() | ||
| assert(dir.listFiles().length === 1) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the test coverage, @thejdeep . |
||
|
|
||
| assert(dir.listFiles().length === 1) | ||
| assert(provider.getAppUI("nonexist", None).isEmpty) | ||
| assert(provider.getListing().length === (if (onDemandEnabled) 1 else 0)) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This new line verifies the cleanup, @mridulm . |
||
|
|
||
| provider.stop() | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-36354: EventLogFileReader should skip rolling event log directories with no logs") { | ||
| withTempDir { dir => | ||
| val conf = createTestConf(true) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we are trying to push for usage of
RollingEventLogFilesWriteras the new default but for users who have single event logs and if they try to get the UI for an app, will this functionality not break for them sinceEVENT_LOG_ROLLING_ON_DEMAND_LOAD_ENABLEDis true by default ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what we can break here, @thejdeep ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dummy metadata is added and cleaned up at
FileNotFoundexception immediate in this function as @mridulm requested. It works for both non-existing appId and SingleFile logs.