Skip to content

Conversation

@ajithme
Copy link
Contributor

@ajithme ajithme commented Jan 16, 2020

What changes were proposed in this pull request?

Forcing partition evaluation in callsite thread before sending org.apache.spark.scheduler.JobSubmitted event to org.apache.spark.scheduler.DAGScheduler#eventProcessLoop can help in mitigation against job submission event blocking the DAGScheduler thread

Why are the changes needed?

DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted events has to be processed as DAGSchedulerEventProcessLoop is single threaded and it will block other tasks in queue like TaskCompletion.
The JobSubmitted event is time consuming depending on the nature of the job (Example: calculating parent stage dependencies, shuffle dependencies, partitions) and thus it blocks all the events to be processed.

Similarly in my cluster some jobs partition calculation is time consuming (Similar to stack at SPARK-2647) hence it slows down the spark DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if its tasks are finished within seconds, as TaskCompletion Events are processed at a slower rate due to blockage.

Refer: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Scheduler-Spark-DAGScheduler-scheduling-performance-hindered-on-JobSubmitted-Event-td23562.html

I see multiple JIRA referring to this behavior
https://issues.apache.org/jira/browse/SPARK-2647
https://issues.apache.org/jira/browse/SPARK-4961

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT to reproduce and evaluate fix.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ajithme
Copy link
Contributor Author

ajithme commented Jan 16, 2020

This PR is reviving #24438 as it was closed due to inactivity. As @squito had mentioned in the old PR about guarding partition state of RDD using a lock Refer comment: #24438 (review) , this has been accomplished by #25951 (SPARK-28917)

Please review @squito @dongjoon-hyun @vanzin @srowen

@ajithme
Copy link
Contributor Author

ajithme commented Feb 18, 2020

gentle ping @squito @dongjoon-hyun @vanzin @srowen

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants