-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6951][core] Speed up parsing of event logs during listing. #20952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change introduces two optimizations to help speed up generation
of listing data when parsing events logs.
The first one allows the parser to be stopped when enough data to
create the listing entry has been read. This is currently the start
event plus environment info, to capture UI ACLs. If the end event is
needed, the code will skip to the end of the log to try to find that
information, instead of parsing the whole log file.
Unfortunately this works better with uncompressed logs. Skipping bytes
on compressed logs only saves the work of parsing lines and some events,
so not a lot of gains are observed.
The second optimization deals with in-progress logs. It works in two
ways: first, it completely avoids parsing the rest of the log for
these apps when enough listing data is read. This, unlike the above,
also speeds things up for compressed logs, since only the very beginning
of the log has to be read.
On top of data, the code that decides whether to re-parse logs to get
updated listing data will ignore in-progress applications until they've
completed.
Both optimizations can be disabled but are enabled by default.
I tested this on some fake event logs to see the effect. I created
500 logs of about 60M each (so ~30G uncompressed; each log was 1.7M
when compressed with zstd). Below, C = completed, IP = in-progress,
the size means the amount of data re-parsed at the end of logs
when necessary.
```
none/C none/IP zstd/C zstd/IP
On / 16k 2s 2s 22s 2s
On / 1m 3s 2s 24s 2s
Off 1.1m 1.1m 26s 24s
```
This was with 4 threads on a single local SSD. As expected from the
previous explanations, there are considerable gains for in-progress
logs, and for uncompressed logs, but not so much when looking at the
full compressed log.
As a side note, I removed the custom code to get the scan time by
creating a file on HDFS; since file mod times are not used to detect
changed logs anymore, local time is enough for the current use of
the SHS.
|
Test build #88767 has finished for PR 20952 at commit
|
|
retest this please |
|
Test build #88774 has finished for PR 20952 at commit
|
|
typo in summary: "On top of data" --> "On top of that" (I think) |
| // used for logging msgs (logs are re-scanned based on file size, rather than modtime) | ||
| private val lastScanTime = new java.util.concurrent.atomic.AtomicLong(-1) | ||
|
|
||
| private val pendingReplayTasksCount = new java.util.concurrent.atomic.AtomicInteger(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class comment is no longer correct when it talks about finding new attempts based on modification time. You should have a listing entry for every file in the dir
| // When fast in-progress parsing is on, we don't need to re-parse when the | ||
| // size changes, but we do need to invalidate any existing UIs. | ||
| invalidateUI(info.appId.get, info.attemptId) | ||
| false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to update info.fileSize here too? Because you skip the re-parse, you don't update it in mergeApplicationListings. So i think once you hit this condition once, you'll always invalidate the UI on every iteration.
|
lgtm assuming tests pass |
|
Test build #89076 has finished for PR 20952 at commit
|
|
merged to master |
This change introduces two optimizations to help speed up generation
of listing data when parsing events logs.
The first one allows the parser to be stopped when enough data to
create the listing entry has been read. This is currently the start
event plus environment info, to capture UI ACLs. If the end event is
needed, the code will skip to the end of the log to try to find that
information, instead of parsing the whole log file.
Unfortunately this works better with uncompressed logs. Skipping bytes
on compressed logs only saves the work of parsing lines and some events,
so not a lot of gains are observed.
The second optimization deals with in-progress logs. It works in two
ways: first, it completely avoids parsing the rest of the log for
these apps when enough listing data is read. This, unlike the above,
also speeds things up for compressed logs, since only the very beginning
of the log has to be read.
On top of that, the code that decides whether to re-parse logs to get
updated listing data will ignore in-progress applications until they've
completed.
Both optimizations can be disabled but are enabled by default.
I tested this on some fake event logs to see the effect. I created
500 logs of about 60M each (so ~30G uncompressed; each log was 1.7M
when compressed with zstd). Below, C = completed, IP = in-progress,
the size means the amount of data re-parsed at the end of logs
when necessary.
This was with 4 threads on a single local SSD. As expected from the
previous explanations, there are considerable gains for in-progress
logs, and for uncompressed logs, but not so much when looking at the
full compressed log.
As a side note, I removed the custom code to get the scan time by
creating a file on HDFS; since file mod times are not used to detect
changed logs anymore, local time is enough for the current use of
the SHS.