Skip to content

Commit f4317be

Browse files
CodingCatzsxwing
authored andcommitted
[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets
## What changes were proposed in this pull request? the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) Let's consider the following case: A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully. 1. The main thread in the Spark streaming application will execute the line mentioned above, 2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed. 3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished ## How was this patch tested? existing tests Author: CodingCat <[email protected]> Author: Nan Zhu <[email protected]> Closes #16542 from CodingCat/SPARK-18905. (cherry picked from commit f8db894) Signed-off-by: Shixiong Zhu <[email protected]>
1 parent 9758905 commit f4317be

File tree

1 file changed

+8
-6
lines changed

1 file changed

+8
-6
lines changed

streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -200,18 +200,20 @@ class JobScheduler(val ssc: StreamingContext) extends Logging {
200200
listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
201201
logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
202202
if (jobSet.hasCompleted) {
203-
jobSets.remove(jobSet.time)
204-
jobGenerator.onBatchCompletion(jobSet.time)
205-
logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
206-
jobSet.totalDelay / 1000.0, jobSet.time.toString,
207-
jobSet.processingDelay / 1000.0
208-
))
209203
listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
210204
}
211205
job.result match {
212206
case Failure(e) =>
213207
reportError("Error running job " + job, e)
214208
case _ =>
209+
if (jobSet.hasCompleted) {
210+
jobSets.remove(jobSet.time)
211+
jobGenerator.onBatchCompletion(jobSet.time)
212+
logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
213+
jobSet.totalDelay / 1000.0, jobSet.time.toString,
214+
jobSet.processingDelay / 1000.0
215+
))
216+
}
215217
}
216218
}
217219

0 commit comments

Comments
 (0)