[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9428

zsxwing · 2015-11-03T01:39:03Z

Took over #7021 and fixed the following potential issues in the previous PR:

Make sure checkpointing all data even if the Iterator is not drained.
Make sure checkpointing all partitions even if some partitions are not touched.

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala

zsxwing · 2015-11-03T01:40:51Z

/cc @andrewor14 @tdas

tedyu · 2015-11-03T02:40:05Z

core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala

What if _cpDir exists before mkdirs() is called ?

That's how it was before.

If checkpointing fails at the first time, _cpDir won't be deleted. Then the user may try to do it again, so we should allow checkpointing the same RDD even if _cpDir exists.

SparkQA · 2015-11-03T03:45:18Z

Test build #44863 has finished for PR 9428 at commit 647162f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-11-04T01:59:52Z

/cc @JoshRosen

andrewor14 · 2015-11-09T18:51:42Z

retest this please

andrewor14 · 2015-11-09T19:03:37Z

core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala

can you indent these 2 lines

andrewor14 · 2015-11-09T19:56:29Z

@zsxwing Looks great. All my comments are pretty minor. On second thought, local checkpointing doesn't really have this problem so as long as we handle the reliable checkpointing case we're good. Have you had a chance to test this on a real cluster?

SparkQA · 2015-11-09T21:13:30Z

Test build #45393 has finished for PR 9428 at commit 647162f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-11-09T22:45:25Z

core/src/main/scala/org/apache/spark/util/CheckpointingIterator.scala

Use Throwable here because it will be thrown later. It's better to cleanup as well for fatal errors.

Actually, still need to handle ControlThrowable. Updated it.

zsxwing · 2015-11-09T22:46:26Z

Have you had a chance to test this on a real cluster?

Yes. Tested this PR with Streaming updateStateByKey.

zsxwing · 2015-11-10T00:32:35Z

core/src/test/scala/org/apache/spark/CheckpointSuite.scala

Just found a corner case for CheckpointingIterator. In this case, CheckpointingIterator.complete of parCollection will be called before lazyRDD's.

I cannot find any solution for this case since we cannot run CheckpointingIterator.completes in the correct order. Maybe we should revisit the approach of #9258

SparkQA · 2015-11-10T00:53:15Z

Test build #45431 has finished for PR 9428 at commit 676317b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-10T01:44:11Z

Test build #45435 has finished for PR 9428 at commit 49248c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-10T02:42:20Z

Test build #45447 has finished for PR 9428 at commit 93c8feb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-11-13T23:07:23Z

As discussed offline, we cannot go with this approach. @zsxwing can you close this PR for now until we decide to tackle it some other way later?

michaelmior · 2017-10-31T18:53:11Z

@zsxwing @andrewor14 Would either of you be able to explain briefly why this approach doesn't work?

zsxwing · 2017-10-31T18:59:41Z

@michaelmior please take a look at this test runTest("call RDD.iterator lazily").

michaelmior · 2017-10-31T20:06:25Z

@zsxwing Thanks for the pointer. It's not clear to me why this needs to be supported (and in fact the test no longer exists). However, I'm also not clear why the tests fails in the first place (I compiled and ran the code) but that's probably because I'm relatively new to Scala and don't fully understand the semantics of lazy vals.

ferdonline · 2017-11-16T17:28:03Z

Hello. I find this feature to be really important and I would be happy to contribute here. Even though we would potentially not support every use case, it would already be great if in the majority of cases we could avoid the double computation, while in other cases we raise a warning saying that computation is gonna happen twice.

This is specially important for a use case I have where a transformation creates random numbers, so I simply cant recompute things as results will be different. So in my case the only option to break lineage seems to be a full write() followed by read().
Any plans to have it in eager checkpoints at least?

zsxwing · 2017-11-16T18:55:50Z

I simply cant recompute things as results will be different.

A task/stage may run multiple times due to failure. Why is this not a problem for you?

ferdonline · 2017-11-17T09:11:25Z

That's the reason why I want to checkpoint when they are first calculated. Further transformations use these results several times. Of course it's not a problem per se to calculate twice for the checkpoint, but doing so for 1+TB of data is nonsense and I can't cache.

## What changes were proposed in this pull request? This change adds local checkpoint support to datasets and respective bind from Python Dataframe API. If reliability requirements can be lowered to favor performance, as in cases of further quick transformations followed by a reliable save, localCheckpoints() fit very well. Furthermore, at the moment Reliable checkpoints still incur double computation (see apache#9428) In general it makes the API more complete as well. ## How was this patch tested? Python land quick use case: ```python >>> from time import sleep >>> from pyspark.sql import types as T >>> from pyspark.sql import functions as F >>> def f(x): sleep(1) return x*2 ...: >>> df1 = spark.range(30, numPartitions=6) >>> df2 = df1.select(F.udf(f, T.LongType())("id")) >>> %time _ = df2.collect() CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms Wall time: 12.2 s >>> %time df3 = df2.localCheckpoint() CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms Wall time: 10.3 s >>> %time _ = df3.collect() CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms Wall time: 148 ms >>> sc.setCheckpointDir(".") >>> %time df3 = df2.checkpoint() CPU times: user 4.04 ms, sys: 1.63 ms, total: 5.67 ms Wall time: 20.3 s ``` Author: Fernando Pereira <[email protected]> Closes apache#19805 from ferdonline/feature_dataset_localCheckpoint.

viirya and others added 7 commits June 26, 2015 00:50

Add CheckpointingIterator to optimize checkpointing.

d863516

Fix scala style.

1a3055e

Write checkpoint data to disk if it is at the end of iterator.

3c5b203

Merge remote-tracking branch 'upstream/master' into optimize_checkpoint

a829a7d

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala

Fix scala style.

2f43ff3

Merge remote-tracking branch 'origin/master' into pr7021

5c42503

Fix the corner cases in CheckpointingIterator

647162f

zsxwing changed the title ~~Pr7021~~ [SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice Nov 3, 2015

zsxwing mentioned this pull request Nov 3, 2015

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9258

Closed

tedyu reviewed Nov 3, 2015
View reviewed changes

andrewor14 reviewed Nov 9, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala Outdated

Copy link

Contributor

andrewor14 Nov 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you indent these 2 lines

Fix style and comments

676317b

zsxwing reviewed Nov 9, 2015
View reviewed changes

zsxwing added 2 commits November 9, 2015 15:03

Handle ControlThrowable

49248c7

Add a failure test for CheckpointingIterator

93c8feb

zsxwing reviewed Nov 10, 2015
View reviewed changes

zsxwing closed this Nov 13, 2015

ferdonline mentioned this pull request Nov 23, 2017

[SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dataset API #19805

Closed

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9428

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9428

Uh oh!

Conversation

zsxwing commented Nov 3, 2015

Uh oh!

zsxwing commented Nov 3, 2015

Uh oh!

tedyu Nov 3, 2015

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

zsxwing commented Nov 4, 2015

Uh oh!

andrewor14 commented Nov 9, 2015

Uh oh!

andrewor14 Nov 9, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Nov 9, 2015

Uh oh!

SparkQA commented Nov 9, 2015

Uh oh!

zsxwing Nov 9, 2015

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 9, 2015

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Nov 9, 2015

Uh oh!

zsxwing Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 10, 2015

Uh oh!

SparkQA commented Nov 10, 2015

Uh oh!

SparkQA commented Nov 10, 2015

Uh oh!

andrewor14 commented Nov 13, 2015

Uh oh!

michaelmior commented Oct 31, 2017

Uh oh!

zsxwing commented Oct 31, 2017

Uh oh!

michaelmior commented Oct 31, 2017

Uh oh!

ferdonline commented Nov 16, 2017

Uh oh!

zsxwing commented Nov 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ferdonline commented Nov 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zsxwing commented Nov 16, 2017 •

edited

Loading