[SPARK-23626][CORE] DAGScheduler blocked due to JobSubmitted event #27234
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Forcing partition evaluation in
callsitethread before sendingorg.apache.spark.scheduler.JobSubmittedevent toorg.apache.spark.scheduler.DAGScheduler#eventProcessLoopcan help in mitigation against job submission event blocking theDAGSchedulerthreadWhy are the changes needed?
DAGSchedulerbecomes a bottleneck in cluster when multipleJobSubmittedevents has to be processed asDAGSchedulerEventProcessLoopis single threaded and it will block other tasks in queue likeTaskCompletion.The
JobSubmittedevent is time consuming depending on the nature of the job (Example: calculating parent stage dependencies, shuffle dependencies, partitions) and thus it blocks all the events to be processed.Similarly in my cluster some jobs partition calculation is time consuming (Similar to stack at SPARK-2647) hence it slows down the spark
DAGSchedulerEventProcessLoopwhich results in user jobs to slowdown, even if its tasks are finished within seconds, asTaskCompletionEvents are processed at a slower rate due to blockage.Refer: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Scheduler-Spark-DAGScheduler-scheduling-performance-hindered-on-JobSubmitted-Event-td23562.html
I see multiple JIRA referring to this behavior
https://issues.apache.org/jira/browse/SPARK-2647
https://issues.apache.org/jira/browse/SPARK-4961
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UT to reproduce and evaluate fix.