-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time #3794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use createQueryTest
…Scheduler.JobSubmitted processing time
|
Test build #24805 has finished for PR 3794 at commit
|
|
This looks like a legitimate test failure. |
…Scheduler.JobSubmitted processing time
|
Test build #24810 has finished for PR 3794 at commit
|
|
To reformat the PR description to make it a little easier to read:
|
|
To maybe summarize the motivation a bit more succinctly, it seems like the problem here is that the first call to It seems like the fix in this patch is to force
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This comment is kind of moot since I proposed a more general fix in a top-level comment, but I'll still post it anyways:)
I don't think that logging an exception at debug level then returning null is a good error-handling strategy; this is likely to cause a confusing NPE somewhere else with no obvious cause since most users won't have debug-level logging enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the fix in this patch is to force partitions to be eagerly-computed in the driver thread that defines the RDD. This seems like a good idea
How would this interact with the idea of @erikerlandson to defer partition computation?
#3079
…Scheduler.JobSubmitted processing time
|
Test build #24892 timed out for PR 3794 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this now throw a NPE if we call partitions from a worker, since now this will return null after the RDD is serialized and deserialized? I guess maybe we never do that?
…Scheduler.JobSubmitted processing time
|
Test build #25745 has finished for PR 3794 at commit
|
|
@JoshRosen Thanks for your comments. I've updates it. I directly use getParentStages which will call RDD's getPartitions before sending JobSubmitted event. Is it ok? |
|
Good catch on the error-handling logic.
Does this really call |
…Scheduler.JobSubmitted processing time
|
Test build #25784 has finished for PR 3794 at commit
|
|
@JoshRosen Thanks. I've updated it as your comments. Please review again. However, these's merge conflicts. I will resolve this conflict if this approach is passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expand this comment to explain that the reason for performing this call here is that computing the partitions may be very expensive for certain types of RDDs (e.g. HadoopRDDs), so therefore we'd like that computation to take place outside of the DAGScheduler to avoid blocking its event processing loop. I'd also mention SPARK-4961 so that it's easier to find more context on JIRA.
|
This approach looks good to me, so feel free to bring this up to date with master. |
Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
|
Test build #26041 has finished for PR 3794 at commit
|
|
Test build #26042 has finished for PR 3794 at commit
|
|
@JoshRosen I've brought this up to date with master. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this could be a thread-safety issue: getParentStages could call getShuffleMapStage, which mutates a non-thread-safe shuffleToMapStage map. Even if that map was synchronized, we could still have race-conditions between calls from the event processing loop and external calls.
Do you think we could just call rdd.partitions on the final RDD (e.g. the rdd local variable here) instead of calling getParentStages?
|
@JoshRosen I don't think just calling rdd.partitions on the final RDD could achieve our goal. Furthermore, rdd.partitions has been called before: |
|
/cc @marmbrus, since you mentioned seeing this issue before. Do you think the proposal of having our own DAG traversal outside of DAGScheduler + calling |
|
Test build #30147 has finished for PR 3794 at commit
|
|
FYI, this is on my list of old PRs / issues to revisit in the medium-term. I'm also considering adding some instrumentation to DAGScheduler to make this type of blocking / slowdown easier to discover; see https://issues.apache.org/jira/browse/SPARK-8344 |
|
In #7002, I added message processing time metrics to DAGScheduler using Codahale metrics, so it should now be much easier to benchmark this. |
|
@markhamstra @kayousterhout could you have a look? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time.
For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we
want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
need to wait much time. HadoopRDD object could get its partitons when it is instantiated.
We could analyse and compare the execution time before and after optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_____].
(1) The app has only one job
(a)
The execution time of the job before optimization is [time1__][time2_][time3___][time4_____].
The execution time of the job after optimization is....[time1__][time3___][time2_][time4_____].
In summary, if the app has only one job, the total execution time is same before and after optimization.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_____],
job2 execution time is [time2__________][time3___][time4_____],
job3 execution time is................................[time2____][time3___][time4_____],
job4 execution time is................................[time2_____________][time3___][time4_____].
After optimization,
job1 execution time is [time3___][time2_][time4_____],
job2 execution time is [time3___][time2__][time4_____],
job3 execution time is................................[time3___][time2_][time4_____],
job4 execution time is................................[time3___][time2__][time4_____].
In summary, if the app has multiple jobs, average execution time after optimization is less than before.