Skip to content

Conversation

@shahidki31
Copy link
Contributor

@shahidki31 shahidki31 commented Dec 9, 2019

What changes were proposed in this pull request?

Currently loading application page from history server is time consuming, if the eventlog size is high. (Eg: ~47 minutes for a 18GB eventlog file). Currently, when there is any changes in event log, history server parses the entire eventLog even for smaller changes in eventLog. In this PR, we are supporting incremental parsing of event log by storing the in memory store to a hash map and will not close the store until it is valid.

Why are the changes needed?

To speed up loading history server page

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT, Manual tests and existing tests.

Created an eventLog of size ~2GB. Added a few more events. Loading time without the PR is ~1 minute and after the PR, it is around 2 secs.

@shahidki31
Copy link
Contributor Author

cc @vanzin @squito Kindly review

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115049 has finished for PR 26821 at commit db79fdd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SQLAppStatusListenerData(
  • class HiveThriftserver2ListenerData(

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115051 has finished for PR 26821 at commit 1889b53.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @HeartSaVioR

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Dec 10, 2019

Thanks for cc.ing me, @dongjoon-hyun . I'll take a look.

Btw, I think we have another JIRA issue for supporting incremental parsing SPARK-28870 which has broader goal - run with any implementation of KVStore.

At first glance, this patch could cover SPARK-29261 and with SPARK-29111 it may resolve SPARK-28870 altogether - though we struggled on the details previously so I need some time to go through deeply.

@shahidki31
I assume you've been following through the previous discussions/efforts @Ngone51 and me, and @vanzin, @squito have been made. (#25577 and #25943, and relevant google docs in relevant JIRA issues/PRs) If not, it would worth to go through, as we've discussed in details.

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115052 has finished for PR 26821 at commit b6b2c5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

Thanks @HeartSaVioR , I will go through it.
It seems, I haven't used snapshot/checkpoint for storing the underlying kvstore. Instead, in this PR, I am not closing the store, whenever there are changes in event logs or invalidate cache. Currently it supports InMemory store, but we can easily extend to disk store as well.

@HeartSaVioR
Copy link
Contributor

Instead, in this PR, I am not closing the store, whenever there are changes in event logs or invalidate cache.

This patch is simpler because this doesn't take "restarting SHS" into account. Restarting SHS will lose the information. And for now we may not want to tracking line offset in SHS's KV store (listing) since the line offset is only effective during single run of SHS.

When you consider restoring KV store & state of listeners during restarting of SHS, you will have to store the snapshot of KV store into somewhere (that's why SPARK-29111 came in) and then you have to concern about compatibility of snapshot (entities in KV store including live entities on listeners) across Spark versions. That's why I had to change the design and introduce SPARK-29779 instead of snapshotting.

We've already gone through bunch of discussions because it is not that simple in reality as it seems; so please go through these PRs as well as design docs.

I guess the patch can be reviewed right now if the community prefers to have a solution which works within single SHS run first (though this may conflict with compaction #26416 which needs some arrangement), but if the community wants to have a solution which covers more cases, SPARK-28870 seems to be the way to go. (It doesn't mean this patch will not be valid - this patch could cover SPARK-29261 with some modification.)

@oopDaniel
Copy link

I was also working on a PR for incremental parsing in SHS as well, and there are 2 ways to skip the parsed content: filtering by lines (the approach in this PR) and skipping by bytes (see mergeApplicationListing). From what I tested locally, filtering by lines roughly takes about 30 seconds for file sizes ranging from 10MB to 400MB, while skipping by bytes only takes 2 ms for a 400MB file.

@HeartSaVioR
Copy link
Contributor

there are 2 ways to skip the parsed content: filtering by lines (the approach in this PR) and skipping by bytes (see mergeApplicationListing).

Yeah if possible we should deal with latter approach. I guess that brings more changes as ReplayListener just relies on Scala API which provides lines (no offset information) so we should get our hands be dirty, but it definitely worths to do.

@shahidki31
Copy link
Contributor Author

shahidki31 commented Dec 11, 2019

From what I tested locally, filtering by lines roughly takes about 30 seconds for file sizes ranging from 10MB to 400MB, while skipping by bytes only takes 2 ms for a 400MB file.

Hi @oopDaniel , Actually I tried with both the approaches, and it seems skipping bytes seems more complicated as we need to handle more edge cases. Also, I tested this PR with 2GB event log file and I think the time to load UI took around 2 seconds (?) including filtering and replaying. Also it is not difficult to add the skipping bytes, as all we need to do is add the bytes read parameter instead of lines read parameter and handle the edge cases. @HeartSaVioR I think the approach which you guys are doing is great, as it handles restarting SHS. But, if we can review this PR related to incremental parsing, extending to snapshotting would be easier I guess.. If there is a working PR for that, I can close this.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Dec 11, 2019

@shahidki31
No I didn't intend to persuade you to close this. I'd just wanted to make sure we get a clear picture of full implementation before dealing with each part, but it's OK for me if you'd like to deal with current solution as I think I can deal with extending the solution with snapshotting.

I can take a look with current solution, but you still need to persuade at least one committer to push this forward.

Btw, we'd be better to clarify the performance test in details. It should include at least...

  • size of event log file for initial load
  • elapsed time for initial load
  • count/size of events for addition (mostly about size)
  • elapsed time for loading additional events

(and sure it would be nicer if you can experiment with various matrix, at least couple of tests around the size of event log file - as you said, huge event log file doesn't only take couple of GBs. It's 10s of GBs.)

For me, your statement in PR description sounds to me as skipping (via read and drop) 2G takes around 2 secs which is still not ideal (as we know how to do it better), though I agree that's still a huge improvement.

@shahidki31
Copy link
Contributor Author

Btw, we'd be better to clarify the performance test in details. It should include at least...

size of event log file for initial load
elapsed time for initial load
count/size of events for addition (mostly about size)
elapsed time for loading additional events

Thanks @HeartSaVioR , I'll try more performance testing based on size and initial load.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage seems to be really low. I'd expect at least tests each listener can store the information via flush and reload from initialize, or at least these entities can be serialized/deserialized well, because I was encountered some issue when I just tried to serialize entities in events.

// during invalidate UI or detached UI. Metadata of the event read will store in the
// `IncrmentalInfo`. Whenever a new event come, parsing will happen from the line it
// read last time. Currently it supports inmemory store. TODO: Support for disk store.
private val isIncrementalParsingEnabled = storePath.isEmpty &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to not add is/are in the name of value even it's expected to return Boolean. Nearest example is fastInProgressParsing.

// If incremental parsing support configuration is enabled, underlying store will not close
// during invalidate UI or detached UI. Metadata of the event read will store in the
// `IncrmentalInfo`. Whenever a new event come, parsing will happen from the line it
// read last time. Currently it supports inmemory store. TODO: Support for disk store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please place TODO as separate line of comment so that it helps IDE to highlight.


private val storePath = conf.get(LOCAL_STORE_DIR).map(new File(_))
private val fastInProgressParsing = conf.get(FAST_IN_PROGRESS_PARSING)
// If incremental parsing support configuration is enabled, underlying store will not close
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. nit: incremental parsing support configuration seems odd. IMHO, just incremental parsing would work.
  2. there're two kinds of stores in FsHistoryProvider, so maybe better to clarify here; underlying APP kvstore or some better words?

private val storePath = conf.get(LOCAL_STORE_DIR).map(new File(_))
private val fastInProgressParsing = conf.get(FAST_IN_PROGRESS_PARSING)
// If incremental parsing support configuration is enabled, underlying store will not close
// during invalidate UI or detached UI. Metadata of the event read will store in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: invalidating / detaching
nit2: Metadata of reading event log (maybe there should be better words...) will be stored

private val fastInProgressParsing = conf.get(FAST_IN_PROGRESS_PARSING)
// If incremental parsing support configuration is enabled, underlying store will not close
// during invalidate UI or detached UI. Metadata of the event read will store in the
// `IncrmentalInfo`. Whenever a new event come, parsing will happen from the line it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess incremental parsing already explains this sentence, looks redundant. Moreover, technically, parsing doesn't start immediately when a new event comes.

accumulatorId: Long,
metricType: String)

class SQLAppStatusListenerData(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to place it to SQLAppStatusListener as other KVStore entities are there.

kvstore.onFlush {
if (!live) {
flush((entity: LiveEntity) => updateStoreWithTriggerEnabled(entity))
if (appId != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place it before flush()


def initialize(appId: String, attemptId: Option[String]): Unit = {
if (!live) {
this.appId = appId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

provider.mergeApplicationListingCall should be (1)
}

test("support incremental parsing of the event logs") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know most of tests in FsHistoryProviderSuite don't deal with rolling event log, but this should really deal with it, as it's not just simple sequential reading of files.

list.size should be (1)
provider.getAttempt("app1", None).logPath should endWith(EventLogFileWriter.IN_PROGRESS)
val appUi = provider.getAppUI("app1", None)
appUi should not be null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should really check whether store reflects the events correctly. It's not enough to just check it's loaded or not. That should cover these cases - initial read / new addition of events in same file / new file.

@shahidki31
Copy link
Contributor Author

Thanks @HeartSaVioR for the review. I will update the PR as well as post the performance reports.

@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119604 has finished for PR 26821 at commit b6b2c5a.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2020

Test build #120468 has finished for PR 26821 at commit b6b2c5a.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

github-actions bot commented Jul 6, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 6, 2020
@github-actions github-actions bot closed this Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants