[SPARK-28869][CORE] Roll over event log files #25670

HeartSaVioR · 2019-09-03T21:42:34Z

What changes were proposed in this pull request?

This patch is a part of SPARK-28594 and design doc for SPARK-28594 is linked here: https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing

This patch proposes adding new feature to event logging, rolling event log files via configured file size.

Previously event logging is done with single file and related codebase (EventLoggingListener/FsHistoryProvider) is tightly coupled with it. This patch adds layer on both reader (EventLogFileReader) and writer (EventLogFileWriter) to decouple implementation details between "handling events" and "how to read/write events from/to file".

This patch adds two properties, spark.eventLog.rollLog and spark.eventLog.rollLog.maxFileSize which provides configurable behavior of rolling log. The feature is disabled by default, as we only expect huge event log for huge/long-running application. For other cases single event log file would be sufficient and still simpler.

Why are the changes needed?

This is a part of SPARK-28594 which addresses event log growing infinitely for long-running application.

This patch itself also provides some option for the situation where event log file gets huge and consume their storage. End users may give up replaying their events and want to delete the event log file, but given application is still running and writing the file, it's not safe to delete the file. End users will be able to delete some of old files after applying rolling over event log.

Does this PR introduce any user-facing change?

No, as the new feature is turned off by default.

How was this patch tested?

Added unit tests, as well as basic manual tests.

Basic manual tests - ran SHS, ran structured streaming query with roll event log enabled, verified split files are generated as well as SHS can load these files, with handling app status as incomplete/complete.

HeartSaVioR · 2019-09-03T21:47:57Z

Given we've added layer in this patch, would we still want to convert existing tests (EventLoggingListenerSuite/FsHistoryProviderSuite/etc.) to test with both cases (single/rolling)?
I'm seeing some weird exception while testing manually:

19/09/04 06:00:11 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.util.ConcurrentModificationException
	at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
	at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
	at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.map(TraversableLike.scala:237)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:514)
	at org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:520)
	at scala.Option.map(Option.scala:163)
	at org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:519)
	at org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:155)
	at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79)
	at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:96)
	at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:158)
	at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
	at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99)
	at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84)
	at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102)
	at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93)
	at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93)

Fixing it would be simple - clone the properties and include into SparkListenerJobStart, but just want to know it didn't occur before this patch.

HeartSaVioR · 2019-09-03T21:53:05Z

cc. @felixcheung (as shepherd of SPARK-28594) @vanzin @squito @gengliangwang @dongjoon-hyun

also cc. to @Ngone51 as might be interested on this.

SparkQA · 2019-09-03T22:10:01Z

Test build #110065 has finished for PR 25670 at commit f79f42f.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class EventLogFileReader(
class SingleFileEventLogFileReader(
class RollingEventLogFilesFileReader(
abstract class EventLogFileWriter(
class SingleEventLogFileWriter(
class RollingEventLogFilesWriter(

SparkQA · 2019-09-04T00:10:53Z

Test build #110068 has finished for PR 25670 at commit 8c613d5.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-04T00:44:39Z

java.util.ConcurrentModificationException

This is also occurring with current master branch. You can reproduce it with below query in spark-shell, with continuously pushing records to topic1 and topic2.

val bootstrapServers = "localhost:9092"
val checkpointLocation = "/tmp/SPARK-28869-testing"
val sourceTopics = "topic1"
val sourceTopics2 = "topic2"
val targetTopic = "topic3"

val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", sourceTopics).option("startingOffsets", "earliest").load()

val df2 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", sourceTopics2).option("startingOffsets", "earliest").load()

df.union(df2).writeStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("checkpointLocation", checkpointLocation).option("topic", targetTopic).start()

SparkQA · 2019-09-04T03:13:52Z

Test build #110073 has finished for PR 25670 at commit 4bb9de0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2019-09-04T04:07:13Z

java.util.ConcurrentModificationException

This is also occurring with current master branch. You can reproduce it with below query in spark-shell, with continuously pushing records to topic1 and topic2.

val bootstrapServers = "localhost:9092"
val checkpointLocation = "/tmp/SPARK-28869-testing"
val sourceTopics = "topic1"
val sourceTopics2 = "topic2"
val targetTopic = "topic3"

val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", sourceTopics).option("startingOffsets", "earliest").load()

val df2 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", sourceTopics2).option("startingOffsets", "earliest").load()

df.union(df2).writeStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("checkpointLocation", checkpointLocation).option("topic", targetTopic).start()

is there a separate jira on this?

HeartSaVioR · 2019-09-04T04:25:50Z

Yes, filed https://issues.apache.org/jira/browse/SPARK-28967 and submitted a patch #25672

SparkQA · 2019-09-04T21:23:18Z

Test build #110141 has finished for PR 25670 at commit 082cf16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-05T02:52:41Z

Test build #110157 has finished for PR 25670 at commit 72a6253.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-05T23:04:44Z

Test build #110202 has finished for PR 25670 at commit 9b4d53d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-05T23:34:25Z

retest this, please

SparkQA · 2019-09-06T01:14:27Z

Test build #110208 has finished for PR 25670 at commit 9b4d53d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-06T04:31:07Z

First failure: known flaky test, SPARK-26989 (submitted a patch #25706)

Second failure: SparkContext is leaked in test and makes tests failing.

[info] *** 4 SUITES ABORTED ***
[info] *** 131 TESTS FAILED ***
[error] Error: Total 418, Failed 131, Errors 4, Passed 283, Ignored 1
[error] Failed tests:
[error] 	org.apache.spark.streaming.scheduler.JobGeneratorSuite
[error] 	org.apache.spark.streaming.ReceiverInputDStreamSuite
[error] 	org.apache.spark.streaming.WindowOperationsSuite
[error] 	org.apache.spark.streaming.StreamingContextSuite
[error] 	org.apache.spark.streaming.scheduler.ReceiverTrackerSuite
[error] 	org.apache.spark.streaming.CheckpointSuite
[error] 	org.apache.spark.streaming.UISeleniumSuite
[error] 	org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite
[error] 	org.apache.spark.streaming.ReceiverSuite
[error] 	org.apache.spark.streaming.BasicOperationsSuite
[error] 	org.apache.spark.streaming.InputStreamsSuite
[error] Error during tests:
[error] 	org.apache.spark.streaming.MapWithStateSuite
[error] 	org.apache.spark.streaming.DStreamScopeSuite
[error] 	org.apache.spark.streaming.rdd.MapWithStateRDDSuite
[error] 	org.apache.spark.streaming.scheduler.InputInfoTrackerSuite

One of stack trace follows:

[info] JobGeneratorSuite:
[info] - SPARK-6222: Do not clear received block data too soon *** FAILED *** (2 milliseconds)
[info]   org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
[info] org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
[info] org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:851)
[info] org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:85)
[info] org.apache.spark.streaming.TestSuiteBase.setupStreams(TestSuiteBase.scala:317)
[info] org.apache.spark.streaming.TestSuiteBase.setupStreams$(TestSuiteBase.scala:311)
[info] org.apache.spark.streaming.CheckpointSuite.setupStreams(CheckpointSuite.scala:209)
[info] org.apache.spark.streaming.CheckpointSuite.$anonfun$new$3(CheckpointSuite.scala:258)
[info] scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] org.scalatest.Transformer.apply(Transformer.scala:22)
[info] org.scalatest.Transformer.apply(Transformer.scala:20)
[info] org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info] org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
[info] org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info] org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info] org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info] org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info] org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
[info]   at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2512)
[info]   at scala.Option.foreach(Option.scala:274)
[info]   at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2509)
[info]   at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2586)
[info]   at org.apache.spark.SparkContext.<init>(SparkContext.scala:87)
[info]   at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:851)
[info]   at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:85)
[info]   at org.apache.spark.streaming.scheduler.JobGeneratorSuite.$anonfun$new$1(JobGeneratorSuite.scala:65)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
[info]   at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
[info]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
[info]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
[info]   at org.apache.spark.streaming.scheduler.JobGeneratorSuite.org$scalatest$BeforeAndAfter$$super$runTest(JobGeneratorSuite.scala:30)
[info]   at org.scalatest.BeforeAndAfter.runTest(BeforeAndAfter.scala:203)
[info]   at org.scalatest.BeforeAndAfter.runTest$(BeforeAndAfter.scala:192)
[info]   at org.apache.spark.streaming.scheduler.JobGeneratorSuite.runTest(JobGeneratorSuite.scala:30)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite.run(Suite.scala:1147)
[info]   at org.scalatest.Suite.run$(Suite.scala:1129)
[info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
[info]   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
[info]   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
[info]   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.streaming.scheduler.JobGeneratorSuite.org$scalatest$BeforeAndAfter$$super$run(JobGeneratorSuite.scala:30)
[info]   at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
[info]   at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
[info]   at org.apache.spark.streaming.scheduler.JobGeneratorSuite.run(JobGeneratorSuite.scala:30)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)

HeartSaVioR · 2019-09-06T04:31:13Z

retest this, please

SparkQA · 2019-09-06T06:58:05Z

Test build #110220 has finished for PR 25670 at commit 9b4d53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-16T23:50:00Z

I'd like to put some efforts if that accelerates reviewing. Would it help reviewing if I split this to two parts - writer (EventLoggingListener) / reader (SHS)?

HeartSaVioR · 2019-09-16T23:55:09Z

Btw, I'll start working on next stuff which doesn't depend on this patch. Maybe I'll split them down into smaller parts than what I planned. Some parts may not be unused until following part will come.

HeartSaVioR · 2019-09-17T00:47:11Z

retest this, please

SparkQA · 2019-09-17T03:02:43Z

Test build #110704 has finished for PR 25670 at commit 9b4d53d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-17T03:46:45Z

UT failure: SPARK-29104 - not relevant.

HeartSaVioR · 2019-09-17T03:46:52Z

retest this, please

HeartSaVioR · 2019-09-17T03:52:30Z

FYI, #25811 is submitted to cover supporting snapshot/restore KVStore.

SparkQA · 2019-09-17T04:06:11Z

Test build #110730 has finished for PR 25670 at commit 9b4d53d.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-17T04:19:09Z

retest this, please

SparkQA · 2019-09-17T07:05:02Z

Test build #110733 has finished for PR 25670 at commit 9b4d53d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Just a small nit.

vanzin · 2019-10-16T17:47:31Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+    ConfigBuilder("spark.eventLog.rolling.maxFileSize")
      .doc("The max size of event log file to be rolled over.")
      .bytesConf(ByteUnit.BYTE)
+      .checkValue(_ >= (1024 * 1024 * 10), "Max file size of event log should be configured to" +


ByteUnit.MiB.toBytes(10)

vanzin · 2019-10-16T21:03:45Z

Looks ok, will merge tomorrow if no one else comments.

SparkQA · 2019-10-16T23:19:15Z

Test build #112190 has finished for PR 25670 at commit a2f631d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-17T18:14:45Z

Merging to master.

HeartSaVioR · 2019-10-17T22:46:33Z

Thanks all for reviewing and merging!

gaborgsomogyi · 2019-11-06T15:17:59Z

@HeartSaVioR just gone through this PR and plan to join later developments. You can ping me if this feature goes forward. One question what I have now. Do I see correctly that the actual implementation measures the events size before compression? If yes maybe my suggestion can be considered.

Namely I can see 2 possibilities to overcame this but both cases the basic idea is the same. Dstream variable can be wrapped with CountingOutputStream which would measure file size after compression.

If we can NOT treat spark.eventLog.rolling.maxFileSize as soft threshold we can listen at this point (with custom CountingOutputStream) for writes and initiate rolling there.
If we can treat spark.eventLog.rolling.maxFileSize as soft threshold we can just use dstream.getBytesWritten() in the actual condition.

I've tested the second approach with lz4, lzf, snappy, zstd and only lz4 didn't flush the buffer immediately. Of course this doesn't mean 2nd approach is advised, just wanted to give more info...

HyukjinKwon · 2020-01-31T03:38:54Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val EVENT_LOG_ROLLING_MAX_FILE_SIZE =
+    ConfigBuilder("spark.eventLog.rolling.maxFileSize")
+      .doc("The max size of event log file to be rolled over.")


Sorry leaving a comment late like this but it should have been better to say this configuration is only effective when spark.eventLog.rolling.enabled is enabled.

I think there're counter examples in Spark configurations which rely on the fact once there's a configuration .enabled, others are effective only when that is enabled.

Even we only check from the SHS configuration, spark.history.fs.cleaner.*, spark.history.kerberos.*, spark.history.ui.acls.* fall into the case.

Hm, I tend to disagree with omitting such dependent configurations in their documentations. Can we add and link related configurations in the documentations?

Sorry. Looks like we'll have to agree to disagree then. No one has privilege to make someone do the work under his/her authorship which he/she disagrees with - it will end up putting wrong authorship on commit.

@HeartSaVioR, it had to be reviewed. I just happened to review and leave some comments late. Logically if that's not documented, how do users know what configuration is effective when? At least I had to read the codes to confirm.

Also, I am trying to make sure we're on the same page so I wouldn't happen to leave this comment again since you are a regular contributor. I don't think this is a good pattern to don't document the relationship between configurations. I am going to send an email to the dev list.

@HyukjinKwon
Thanks for initiating the thread in dev mailing list. I'm following up the thread and will be back once we get some sort of consensus.

… for rolling event log ### What changes were proposed in this pull request? This patch addresses the post-hoc review comment linked here - #25670 (comment) ### Why are the changes needed? We would like to explicitly document the direct relationship before we finish up structuring of configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27576 from HeartSaVioR/SPARK-28869-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… for rolling event log ### What changes were proposed in this pull request? This patch addresses the post-hoc review comment linked here - #25670 (comment) ### Why are the changes needed? We would like to explicitly document the direct relationship before we finish up structuring of configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27576 from HeartSaVioR/SPARK-28869-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 446b2d2) Signed-off-by: HyukjinKwon <[email protected]>

… for rolling event log ### What changes were proposed in this pull request? This patch addresses the post-hoc review comment linked here - apache#25670 (comment) ### Why are the changes needed? We would like to explicitly document the direct relationship before we finish up structuring of configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes apache#27576 from HeartSaVioR/SPARK-28869-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

This patch is a part of [SPARK-28594](https://issues.apache.org/jira/browse/SPARK-28594) and design doc for SPARK-28594 is linked here: https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing This patch proposes adding new feature to event logging, rolling event log files via configured file size. Previously event logging is done with single file and related codebase (`EventLoggingListener`/`FsHistoryProvider`) is tightly coupled with it. This patch adds layer on both reader (`EventLogFileReader`) and writer (`EventLogFileWriter`) to decouple implementation details between "handling events" and "how to read/write events from/to file". This patch adds two properties, `spark.eventLog.rollLog` and `spark.eventLog.rollLog.maxFileSize` which provides configurable behavior of rolling log. The feature is disabled by default, as we only expect huge event log for huge/long-running application. For other cases single event log file would be sufficient and still simpler. This is a part of SPARK-28594 which addresses event log growing infinitely for long-running application. This patch itself also provides some option for the situation where event log file gets huge and consume their storage. End users may give up replaying their events and want to delete the event log file, but given application is still running and writing the file, it's not safe to delete the file. End users will be able to delete some of old files after applying rolling over event log. No, as the new feature is turned off by default. Added unit tests, as well as basic manual tests. Basic manual tests - ran SHS, ran structured streaming query with roll event log enabled, verified split files are generated as well as SHS can load these files, with handling app status as incomplete/complete. Closes apache#25670 from HeartSaVioR/SPARK-28869. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

### What changes were proposed in this pull request? This PR aims to enable `spark.eventLog.rolling.enabled` by default for Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Spark 3.0.0, we have been using event log rolling not only for **long-running jobs**, but also for **some failed jobs** to archive the partial event logs incrementally. - #25670 ### Does this PR introduce _any_ user-facing change? - No because `spark.eventLog.enabled` is disabled by default. - For the users with `spark.eventLog.enabled=true`, yes, `spark-events` directory will have different layouts. However, all 3.3+ `Spark History Server` can read both old and new event logs. I believe that the event log users are already using this configuration to avoid the loss of event logs for long-running jobs and some failed jobs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43638 from dongjoon-hyun/SPARK-45771. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to enable `spark.eventLog.rolling.enabled` by default for Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Spark 3.0.0, we have been using event log rolling not only for **long-running jobs**, but also for **some failed jobs** to archive the partial event logs incrementally. - apache#25670 ### Does this PR introduce _any_ user-facing change? - No because `spark.eventLog.enabled` is disabled by default. - For the users with `spark.eventLog.enabled=true`, yes, `spark-events` directory will have different layouts. However, all 3.3+ `Spark History Server` can read both old and new event logs. I believe that the event log users are already using this configuration to avoid the loss of event logs for long-running jobs and some failed jobs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43638 from dongjoon-hyun/SPARK-45771. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…g.maxFileSize` ### What changes were proposed in this pull request? This PR aims to lower the minimum limit of `spark.eventLog.rolling.maxFileSize` from `10MiB` to `2MiB` at Apache Spark 4.1.0 while keeping the default (128MiB). ### Why are the changes needed? `spark.eventLog.rolling.maxFileSize` has `10MiB` as the lower bound limit since Apache Spark 3.0.0. - #25670 By reducing the lower bound to `2MiB`, we can allow Spark jobs to write small log files more frequently and faster without waiting for `10MiB`. This is helpful some slow(large micro-batch period) or low-traffic streaming jobs. The users will set a proper value for their jobs. ### Does this PR introduce _any_ user-facing change? There is no behavior change for the existing jobs. This only extends the range of configuration values for a user who wants to have lower values. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51162 from dongjoon-hyun/SPARK-52456. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…in `History Server` ### What changes were proposed in this pull request? This PR aims to support `On-Demand Log Loading` in `History Server` by looking up the **rolling event log locations** even Spark listing didn't finish to load the event log files. ```scala val EVENT_LOG_ROLLING_ON_DEMAND_LOAD_ENABLED = ConfigBuilder("spark.history.fs.eventLog.rolling.onDemandLoadEnabled") .doc("Whether to look up rolling event log locations on demand manner before listing files.") .version("4.1.0") .booleanConf .createWithDefault(true) ``` Previously, Spark History Server will show `Application ... Not Found` page if a job is requested before scanning it even if the file exists in the correct location. So, this PR doesn't introduce any regressions because this aims to introduce a kind of fallback logic to improve error handling . <img width="686" height="359" alt="Screenshot 2025-07-22 at 14 08 21" src="https://github.com/user-attachments/assets/fccb413c-5a57-4918-86c0-28ae81d54873" /> ### Why are the changes needed? Since Apache Spark 3.0, we have been using event log rolling not only for **long-running jobs**, but also for **some failed jobs** to archive the partial event logs incrementally. - #25670 Since Apache Spark 4.0, event log rolling is enabled by default. - #43638 On top of that, this PR aims to reduce storage cost at Apache Spark 4.1. By supporting `On-Demand Loading for rolled event logs`, we can use larger values for `spark.history.fs.update.interval` instead of the default `10s`. Although Spark History logs are consumed in various ways, It has a big benefit because most of successful Spark jobs's logs are not visited by the users. ### Does this PR introduce _any_ user-facing change? No. This is a new feature. ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51604 from dongjoon-hyun/SPARK-52914. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

HeartSaVioR force-pushed the SPARK-28869 branch from 4bb9de0 to 082cf16 Compare September 4, 2019 18:46

dongjoon-hyun added the SPARK CORE label Sep 5, 2019

HeartSaVioR force-pushed the SPARK-28869 branch from 72a6253 to 9b4d53d Compare September 5, 2019 21:25

HeartSaVioR mentioned this pull request Sep 8, 2019

[WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer #25577

Closed

6 tasks

vanzin reviewed Oct 16, 2019

View reviewed changes

Feedback

a2f631d

vanzin closed this in 100fc58 Oct 17, 2019

HeartSaVioR deleted the SPARK-28869 branch November 4, 2019 15:05

gaborgsomogyi mentioned this pull request Nov 8, 2019

[MINOR]FsHistoryProvider import cleanup #26436

Closed

HyukjinKwon reviewed Jan 31, 2020

View reviewed changes

HeartSaVioR mentioned this pull request Feb 14, 2020

[SPARK-28869][DOCS][FOLLOWUP] Add direct relationship between configs for rolling event log #27576

Closed

dongjoon-hyun mentioned this pull request Nov 2, 2023

[SPARK-45771][CORE] Enable spark.eventLog.rolling.enabled by default #43638

Closed

dongjoon-hyun mentioned this pull request Jun 11, 2025

[SPARK-52456][CORE] Lower the minimum limit of spark.eventLog.rolling.maxFileSize #51162

Closed

dongjoon-hyun mentioned this pull request Jul 22, 2025

[SPARK-52914][CORE] Support On-Demand Log Loading for rolling logs in History Server #51604

Closed

Uh oh!

[SPARK-28869][CORE] Roll over event log files #25670

[SPARK-28869][CORE] Roll over event log files #25670

Uh oh!

Conversation

HeartSaVioR commented Sep 3, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Sep 3, 2019

Uh oh!

HeartSaVioR commented Sep 3, 2019

Uh oh!

SparkQA commented Sep 3, 2019

Uh oh!

SparkQA commented Sep 4, 2019

Uh oh!

HeartSaVioR commented Sep 4, 2019

Uh oh!

SparkQA commented Sep 4, 2019

Uh oh!

felixcheung commented Sep 4, 2019

Uh oh!

HeartSaVioR commented Sep 4, 2019

Uh oh!

SparkQA commented Sep 4, 2019

Uh oh!

SparkQA commented Sep 5, 2019

Uh oh!

SparkQA commented Sep 5, 2019

Uh oh!

HeartSaVioR commented Sep 5, 2019

Uh oh!

SparkQA commented Sep 6, 2019

Uh oh!

HeartSaVioR commented Sep 6, 2019

Uh oh!

HeartSaVioR commented Sep 6, 2019

Uh oh!

SparkQA commented Sep 6, 2019

Uh oh!

HeartSaVioR commented Sep 16, 2019

Uh oh!

HeartSaVioR commented Sep 16, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

vanzin Oct 16, 2019

Choose a reason for hiding this comment

Uh oh!

vanzin commented Oct 16, 2019

Uh oh!

SparkQA commented Oct 16, 2019

Uh oh!

vanzin commented Oct 17, 2019

Uh oh!

HeartSaVioR commented Oct 17, 2019

Uh oh!

gaborgsomogyi commented Nov 6, 2019

Uh oh!

HyukjinKwon Jan 31, 2020