[SPARK-6951][core] Speed up parsing of event logs during listing. #20952

vanzin · 2018-03-30T21:33:42Z

This change introduces two optimizations to help speed up generation
of listing data when parsing events logs.

The first one allows the parser to be stopped when enough data to
create the listing entry has been read. This is currently the start
event plus environment info, to capture UI ACLs. If the end event is
needed, the code will skip to the end of the log to try to find that
information, instead of parsing the whole log file.

Unfortunately this works better with uncompressed logs. Skipping bytes
on compressed logs only saves the work of parsing lines and some events,
so not a lot of gains are observed.

The second optimization deals with in-progress logs. It works in two
ways: first, it completely avoids parsing the rest of the log for
these apps when enough listing data is read. This, unlike the above,
also speeds things up for compressed logs, since only the very beginning
of the log has to be read.

On top of that, the code that decides whether to re-parse logs to get
updated listing data will ignore in-progress applications until they've
completed.

Both optimizations can be disabled but are enabled by default.

I tested this on some fake event logs to see the effect. I created
500 logs of about 60M each (so ~30G uncompressed; each log was 1.7M
when compressed with zstd). Below, C = completed, IP = in-progress,
the size means the amount of data re-parsed at the end of logs
when necessary.

            none/C   none/IP   zstd/C   zstd/IP
On / 16k      2s       2s       22s       2s
On / 1m       3s       2s       24s       2s
Off          1.1m     1.1m      26s      24s

This was with 4 threads on a single local SSD. As expected from the
previous explanations, there are considerable gains for in-progress
logs, and for uncompressed logs, but not so much when looking at the
full compressed log.

As a side note, I removed the custom code to get the scan time by
creating a file on HDFS; since file mod times are not used to detect
changed logs anymore, local time is enough for the current use of
the SHS.

This change introduces two optimizations to help speed up generation of listing data when parsing events logs. The first one allows the parser to be stopped when enough data to create the listing entry has been read. This is currently the start event plus environment info, to capture UI ACLs. If the end event is needed, the code will skip to the end of the log to try to find that information, instead of parsing the whole log file. Unfortunately this works better with uncompressed logs. Skipping bytes on compressed logs only saves the work of parsing lines and some events, so not a lot of gains are observed. The second optimization deals with in-progress logs. It works in two ways: first, it completely avoids parsing the rest of the log for these apps when enough listing data is read. This, unlike the above, also speeds things up for compressed logs, since only the very beginning of the log has to be read. On top of data, the code that decides whether to re-parse logs to get updated listing data will ignore in-progress applications until they've completed. Both optimizations can be disabled but are enabled by default. I tested this on some fake event logs to see the effect. I created 500 logs of about 60M each (so ~30G uncompressed; each log was 1.7M when compressed with zstd). Below, C = completed, IP = in-progress, the size means the amount of data re-parsed at the end of logs when necessary. ``` none/C none/IP zstd/C zstd/IP On / 16k 2s 2s 22s 2s On / 1m 3s 2s 24s 2s Off 1.1m 1.1m 26s 24s ``` This was with 4 threads on a single local SSD. As expected from the previous explanations, there are considerable gains for in-progress logs, and for uncompressed logs, but not so much when looking at the full compressed log. As a side note, I removed the custom code to get the scan time by creating a file on HDFS; since file mod times are not used to detect changed logs anymore, local time is enough for the current use of the SHS.

SparkQA · 2018-03-31T00:57:11Z

Test build #88767 has finished for PR 20952 at commit 5d3e0f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-03-31T01:00:06Z

retest this please

SparkQA · 2018-03-31T04:59:56Z

Test build #88774 has finished for PR 20952 at commit 5d3e0f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-09T15:05:35Z

typo in summary: "On top of data" --> "On top of that" (I think)

squito · 2018-04-09T18:27:27Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

  // used for logging msgs (logs are re-scanned based on file size, rather than modtime)
  private val lastScanTime = new java.util.concurrent.atomic.AtomicLong(-1)

  private val pendingReplayTasksCount = new java.util.concurrent.atomic.AtomicInteger(0)


The class comment is no longer correct when it talks about finding new attempts based on modification time. You should have a listing entry for every file in the dir

squito · 2018-04-09T19:01:56Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+                // When fast in-progress parsing is on, we don't need to re-parse when the
+                // size changes, but we do need to invalidate any existing UIs.
+                invalidateUI(info.appId.get, info.attemptId)
+                false


do you need to update info.fileSize here too? Because you skip the re-parse, you don't update it in mergeApplicationListings. So i think once you hit this condition once, you'll always invalidate the UI on every iteration.

squito · 2018-04-09T21:34:38Z

lgtm assuming tests pass

SparkQA · 2018-04-10T01:38:18Z

Test build #89076 has finished for PR 20952 at commit 3a5563b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-11T14:53:01Z

merged to master

squito reviewed Apr 9, 2018

View reviewed changes

Feedback.

3a5563b

asfgit closed this in 653fe02 Apr 11, 2018

vanzin deleted the SPARK-6951 branch April 13, 2018 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-6951][core] Speed up parsing of event logs during listing. #20952

[SPARK-6951][core] Speed up parsing of event logs during listing. #20952

Uh oh!

vanzin commented Mar 30, 2018 •

edited

Loading

Uh oh!

SparkQA commented Mar 31, 2018

Uh oh!

vanzin commented Mar 31, 2018

Uh oh!

SparkQA commented Mar 31, 2018

Uh oh!

squito commented Apr 9, 2018

Uh oh!

squito Apr 9, 2018

Uh oh!

squito Apr 9, 2018

Uh oh!

squito commented Apr 9, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

squito commented Apr 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-6951][core] Speed up parsing of event logs during listing. #20952

[SPARK-6951][core] Speed up parsing of event logs during listing. #20952

Uh oh!

Conversation

vanzin commented Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 31, 2018

Uh oh!

vanzin commented Mar 31, 2018

Uh oh!

SparkQA commented Mar 31, 2018

Uh oh!

squito commented Apr 9, 2018

Uh oh!

squito Apr 9, 2018

Choose a reason for hiding this comment

Uh oh!

squito Apr 9, 2018

Choose a reason for hiding this comment

Uh oh!

squito commented Apr 9, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

squito commented Apr 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vanzin commented Mar 30, 2018 •

edited

Loading