[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time #3794

YanTangZhai · 2014-12-25T03:19:51Z

HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time.
For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we
want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
need to wait much time. HadoopRDD object could get its partitons when it is instantiated.
We could analyse and compare the execution time before and after optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_____].
(1) The app has only one job
(a)
The execution time of the job before optimization is [time1__][time2_][time3___][time4_____].
The execution time of the job after optimization is....[time1__][time3___][time2_][time4_____].
In summary, if the app has only one job, the total execution time is same before and after optimization.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_____],
job2 execution time is [time2__________][time3___][time4_____],
job3 execution time is................................[time2____][time3___][time4_____],
job4 execution time is................................[time2_____________][time3___][time4_____].
After optimization,
job1 execution time is [time3___][time2_][time4_____],
job2 execution time is [time3___][time2__][time4_____],
job3 execution time is................................[time3___][time2_][time4_____],
job4 execution time is................................[time3___][time2__][time4_____].
In summary, if the app has multiple jobs, average execution time after optimization is less than before.

update

Update

update

Update

update

Use createQueryTest

update

…Scheduler.JobSubmitted processing time

SparkQA · 2014-12-25T04:25:44Z

Test build #24805 has finished for PR 3794 at commit 5601a8b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-12-25T04:31:03Z

This looks like a legitimate test failure.

…Scheduler.JobSubmitted processing time

SparkQA · 2014-12-25T07:33:59Z

Test build #24810 has finished for PR 3794 at commit af5abda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-12-29T19:36:01Z

To reformat the PR description to make it a little easier to read:

HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time. For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't need to wait much time. HadoopRDD object could get its partitons when it is instantiated.

We could analyse and compare the execution time before and after optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_____].
(1) The app has only one job
(a)
The execution time of the job before optimization is [time1__][time2_][time3___][time4_____].
The execution time of the job after optimization is....[time1__][time3___][time2_][time4_____].
In summary, if the app has only one job, the total execution time is same before and after optimization.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_____],
job2 execution time is [time2__________][time3___][time4_____],
job3 execution time is................................[time2____][time3___][time4_____],
job4 execution time is................................[time2_____________][time3___][time4_____].
After optimization,
job1 execution time is [time3___][time2_][time4_____],
job2 execution time is [time3___][time2__][time4_____],
job3 execution time is................................[time3___][time2_][time4_____],
job4 execution time is................................[time3___][time2__][time4_____].
In summary, if the app has multiple jobs, average execution time after optimization is less than before.

JoshRosen · 2014-12-29T19:46:44Z

To maybe summarize the motivation a bit more succinctly, it seems like the problem here is that the first call to rdd.partitions might be expensive and might occur inside the DAGScheduler event loop, blocking the entire scheduler. I guess this is an unfortunate side-effect of laziness: we might have expensive lazy initialization but it can be hard to reason about when/where it will occur, causing difficult-to-diagnose performance bottlenecks.

It seems like the fix in this patch is to force partitions to be eagerly-computed in the driver thread that defines the RDD. This seems like a good idea, but I have a few minor nits with the fix as it's currently implemented:

I understand that the motivation for this is HadoopRDD's expensive getPartitions method, but it seems like the problem is potentially more general. Is there any way to handle this RDD instead? I understand that we can't just make partitions into a val, but it looks like the @transient partitions_ logic is already there in RDD, so maybe we could just toss a self.partitions() call into the RDD constructor to force eager evaluation on the driver?
If there's some reason that we can't implement my proposal in RDD, then I think we can just add a call to self.partitions()at the end of HadoopRDD; this would eliminate the need for a bunch of the confusing variable names added here.

JoshRosen · 2014-12-29T19:48:38Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

(This comment is kind of moot since I proposed a more general fix in a top-level comment, but I'll still post it anyways:)

I don't think that logging an exception at debug level then returning null is a good error-handling strategy; this is likely to cause a confusing NPE somewhere else with no obvious cause since most users won't have debug-level logging enabled.

It seems like the fix in this patch is to force partitions to be eagerly-computed in the driver thread that defines the RDD. This seems like a good idea

How would this interact with the idea of @erikerlandson to defer partition computation?
#3079

…Scheduler.JobSubmitted processing time

SparkQA · 2014-12-30T14:42:33Z

Test build #24892 timed out for PR 3794 at commit 6e95955 after a configured wait of 120m.

JoshRosen · 2014-12-30T18:13:58Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

Won't this now throw a NPE if we call partitions from a worker, since now this will return null after the RDD is serialized and deserialized? I guess maybe we never do that?

…Scheduler.JobSubmitted processing time

SparkQA · 2015-01-19T08:06:55Z

Test build #25745 has finished for PR 3794 at commit b535a53.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Update

YanTangZhai · 2015-01-19T11:46:24Z

@JoshRosen Thanks for your comments. I've updates it. I directly use getParentStages which will call RDD's getPartitions before sending JobSubmitted event. Is it ok?

JoshRosen · 2015-01-19T19:35:28Z

Good catch on the error-handling logic.

I directly use getParentStages which will call RDD's getPartitions before sending JobSubmitted event.

Does this really call .partitions? It looks like getParentStages just looks at RDDs' dependencies. I was suggesting something more like using getParentStages to get the list of RDDs, then explicitly doing _.foreach(_.partitions) on that list.

…Scheduler.JobSubmitted processing time

SparkQA · 2015-01-20T02:55:41Z

Test build #25784 has finished for PR 3794 at commit aed530b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

YanTangZhai · 2015-01-20T09:43:55Z

@JoshRosen Thanks. I've updated it as your comments. Please review again. However, these's merge conflicts. I will resolve this conflict if this approach is passed.

JoshRosen · 2015-01-23T07:10:39Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

I'd expand this comment to explain that the reason for performing this call here is that computing the partitions may be very expensive for certain types of RDDs (e.g. HadoopRDDs), so therefore we'd like that computation to take place outside of the DAGScheduler to avoid blocking its event processing loop. I'd also mention SPARK-4961 so that it's easier to find more context on JIRA.

JoshRosen · 2015-01-23T07:11:30Z

This approach looks good to me, so feel free to bring this up to date with master.

Update

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

SparkQA · 2015-01-24T09:16:54Z

Test build #26041 has finished for PR 3794 at commit 267e375.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-01-24T09:30:50Z

Test build #26042 has finished for PR 3794 at commit d5c0e84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

YanTangZhai · 2015-01-24T09:57:38Z

@JoshRosen I've brought this up to date with master. Thanks.

JoshRosen · 2015-01-25T19:57:28Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

I just realized that this could be a thread-safety issue: getParentStages could call getShuffleMapStage, which mutates a non-thread-safe shuffleToMapStage map. Even if that map was synchronized, we could still have race-conditions between calls from the event processing loop and external calls.

Do you think we could just call rdd.partitions on the final RDD (e.g. the rdd local variable here) instead of calling getParentStages?

YanTangZhai · 2015-01-26T01:47:58Z

@JoshRosen I don't think just calling rdd.partitions on the final RDD could achieve our goal. Furthermore, rdd.partitions has been called before:
470 // Check to make sure we are not launching a task on a partition that does not exist.
471 val maxPartitions = rdd.partitions.length
However, it does not work for some scene like the example contrived by me.
To avoid thread-safety issue, do you think we could use another method to get parent stages which does not mutate any global map, or we could just use another method like getParentPartitions committed by me before to get partitions directly?

JoshRosen · 2015-01-28T00:54:38Z

/cc @marmbrus, since you mentioned seeing this issue before. Do you think the proposal of having our own DAG traversal outside of DAGScheduler + calling partitions there will fix the case that you encountered?

SparkQA · 2015-04-13T07:29:25Z

Test build #30147 has finished for PR 3794 at commit d5c0e84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

JoshRosen · 2015-06-13T19:49:47Z

FYI, this is on my list of old PRs / issues to revisit in the medium-term. I'm also considering adding some instrumentation to DAGScheduler to make this type of blocking / slowdown easier to discover; see https://issues.apache.org/jira/browse/SPARK-8344

JoshRosen · 2015-06-27T22:00:55Z

In #7002, I added message processing time metrics to DAGScheduler using Codahale metrics, so it should now be much easier to benchmark this.

andrewor14 · 2015-12-15T20:11:07Z

@markhamstra @kayousterhout could you have a look?

rxin · 2015-12-31T02:41:03Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

YanTangZhai and others added 19 commits August 6, 2014 21:07

Merge pull request #1 from apache/master

cdef539

update

Merge pull request #3 from apache/master

cbcba66

Update

Merge pull request #6 from apache/master

8a00106

Update

Merge pull request #7 from apache/master

03b62b0

Update

Merge pull request #8 from apache/master

76d4027

update

Merge pull request #9 from apache/master

d26d982

Update

Merge pull request #10 from apache/master

e249846

Update

Merge pull request #11 from apache/master

6e643f8

Update

Update HiveQl.scala

92242c7

Update HiveQuerySuite.scala

74175b4

Update HiveQuerySuite.scala

950b21e

Merge pull request #12 from apache/master

718afeb

update

make hive test

59e4de9

Merge pull request #14 from marmbrus/pr/3555

1893956

Use createQueryTest

Update HiveQuerySuite.scala

bd2c444

Update HiveQuerySuite.scala

efc4210

Update HiveQuerySuite.scala

1e1ebb4

Merge pull request #15 from apache/master

e4c2c0a

update

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAG…

5601a8b

…Scheduler.JobSubmitted processing time

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAG…

af5abda

…Scheduler.JobSubmitted processing time

JoshRosen reviewed Dec 29, 2014
View reviewed changes

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAG…

6e95955

…Scheduler.JobSubmitted processing time

JoshRosen reviewed Dec 30, 2014
View reviewed changes

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAG…

b535a53

…Scheduler.JobSubmitted processing time

Merge pull request #27 from apache/master

e2880f9

Update

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAG…

aed530b

…Scheduler.JobSubmitted processing time

JoshRosen reviewed Jan 23, 2015
View reviewed changes

YanTangZhai and others added 6 commits January 24, 2015 15:23

Merge pull request #28 from apache/master

5b27571

Update

Merge branch 'master' into SPARK-4961

267e375

Merge branch 'master' of https://github.com/YanTangZhai/spark

e2c2494

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

update

572079b

update

ae7c139

update

d5c0e84

JoshRosen reviewed Jan 25, 2015
View reviewed changes

asfgit closed this in 7b4452b Dec 31, 2015

JoshRosen mentioned this pull request Oct 12, 2021

[SPARK-23626][CORE] Eagerly compute RDD.partitions on entire DAG when submitting job to DAGScheduler #34265

Closed

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time #3794

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time #3794

Uh oh!

Conversation

YanTangZhai commented Dec 25, 2014

Uh oh!

SparkQA commented Dec 25, 2014

Uh oh!

JoshRosen commented Dec 25, 2014

Uh oh!

SparkQA commented Dec 25, 2014

Uh oh!

JoshRosen commented Dec 29, 2014

Uh oh!

JoshRosen commented Dec 29, 2014

Uh oh!

JoshRosen Dec 29, 2014

Choose a reason for hiding this comment

Uh oh!

markhamstra Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 30, 2014

Uh oh!

JoshRosen Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 19, 2015

Uh oh!

YanTangZhai commented Jan 19, 2015

Uh oh!

JoshRosen commented Jan 19, 2015

Uh oh!

SparkQA commented Jan 20, 2015

Uh oh!

YanTangZhai commented Jan 20, 2015

Uh oh!

JoshRosen Jan 23, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

YanTangZhai commented Jan 24, 2015

Uh oh!

JoshRosen Jan 25, 2015

Choose a reason for hiding this comment

Uh oh!

YanTangZhai commented Jan 26, 2015

Uh oh!

JoshRosen commented Jan 28, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

JoshRosen commented Jun 13, 2015

Uh oh!

JoshRosen commented Jun 27, 2015

Uh oh!

andrewor14 commented Dec 15, 2015

Uh oh!

rxin commented Dec 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants