[SPARK-8582][Core] Add CheckpointingIterator to optimize checkpointing #7021

viirya · 2015-06-25T16:53:23Z

JIRA: https://issues.apache.org/jira/browse/SPARK-8582

SparkQA · 2015-06-25T17:02:34Z

Test build #35794 has finished for PR 7021 at commit d863516.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-25T18:56:10Z

Test build #35796 has finished for PR 7021 at commit 1a3055e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-06-26T09:54:57Z

core/src/main/scala/org/apache/spark/rdd/PartitionerAwareUnionRDD.scala

The partitioners of the rdds might have different numPartitions. It will causes error later.

SparkQA · 2015-06-26T11:45:38Z

Test build #35854 has finished for PR 7021 at commit 3c5b203.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-06-26T12:27:17Z

retest this please.

SparkQA · 2015-06-26T14:56:48Z

Test build #35859 has finished for PR 7021 at commit 3c5b203.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-06-26T15:00:49Z

unrelated failure again. Looks like jenkin is unstable now?

viirya · 2015-06-26T15:00:55Z

retest this please.

SparkQA · 2015-06-26T17:29:14Z

Test build #35862 has finished for PR 7021 at commit 3c5b203.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-02T00:40:38Z

@viirya Thanks for working on this. Could you add some tests for this new iterator? In particular, we should have a test that fails before but no longer fails afterwards.

andrewor14 · 2015-07-02T00:41:21Z

core/src/main/scala/org/apache/spark/util/CheckpointingIterator.scala

these should all be indented two spaces

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala

viirya · 2015-07-03T16:18:29Z

@andrewor14 Thanks. I have added few tests for the new iterator. Other comments are addressed too.

SparkQA · 2015-07-03T16:21:09Z

Test build #36508 has finished for PR 7021 at commit a829a7d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-03T18:35:39Z

Test build #36509 has finished for PR 7021 at commit 2f43ff3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-06T19:02:13Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

this could be

checkpointData .map(_.getCheckpointIterator(iter, context, split.index)) .getOrElse(iter)

andrewor14 · 2015-07-06T19:07:07Z

core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala

shouldn't this read from RDDCheckpointData.rddCheckpointDataPath?

actually, the path here should just be checkpointPath. Right now this duplicates some code.

andrewor14 · 2015-07-06T23:57:45Z

@viirya Thanks for the tackling this issue, but I believe the existing implementation is not fully correct.

There are two high level problems: First, if the checkpointing iterator is not fully consumed by the user, then we end up checkpointing only a subset of the computed data. I think we should ensure that the iterator is fully drained before we can safely truncate the RDD's lineage through rdd.markCheckpointed.

Second, the state transition from Initialized -> CheckpointingInProgress -> Checkpointed is not respected. In the new model, we should transition into CheckpointingInProgress as soon as the iterator is returned so multiple calls to it will not lead to the RDD being checkpointed many times. Then only after we fully iterate through the iterator can we declare the RDD as Checkpointed.

I actually don't have a great idea on how to fix the first issue, however. We do not really have any visibility on how the higher level caller with use the iterator, and if we consume it eagerly ourselves then the application might fail. @tdas this seems like a fundamentally difficult problem.

andrewor14 · 2015-07-07T00:01:22Z

Ah, one thing we could do is the following: in doCheckpoint, we check if the iterator still has values. If it does, then just keep calling next until it is fully drained. This ensures the RDD will always be fully checkpointed after an action.

andrewor14 · 2015-07-07T00:39:24Z

@viirya by the way, I'm currently working on a major refactoring of all of this code in parallel. There will likely be a lot of conflicts to resolve at this rate. If you prefer, I could take up this issue and use your patch as a basis. In the release we'll be sure to give you credit for this fix. What do you think?

viirya · 2015-07-07T01:14:31Z

@andrewor14 no problem, thanks.

andrewor14 · 2015-07-09T19:23:50Z

As discussed I have opened a patch #7279 that refactors all of this. After that one is merged I'll fix SPARK-8582 in the new refactored code based on the changes here. @viirya would you mind closing this then? Thanks for your time.

Add CheckpointingIterator to optimize checkpointing.

d863516

Fix scala style.

1a3055e

Write checkpoint data to disk if it is at the end of iterator.

3c5b203

viirya reviewed Jun 26, 2015
View reviewed changes

andrewor14 reviewed Jul 2, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/util/CheckpointingIterator.scala Outdated

Copy link

Contributor

andrewor14 Jul 2, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should all be indented two spaces

Merge remote-tracking branch 'upstream/master' into optimize_checkpoint

a829a7d

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala

Fix scala style.

2f43ff3

andrewor14 reviewed Jul 6, 2015
View reviewed changes

viirya closed this Jul 15, 2015

zsxwing mentioned this pull request Oct 24, 2015

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9258

Closed

zsxwing mentioned this pull request Nov 3, 2015

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9428

Closed

viirya deleted the optimize_checkpoint branch December 27, 2023 18:17

[SPARK-8582][Core] Add CheckpointingIterator to optimize checkpointing #7021

[SPARK-8582][Core] Add CheckpointingIterator to optimize checkpointing #7021

Uh oh!

Conversation

viirya commented Jun 25, 2015

Uh oh!

SparkQA commented Jun 25, 2015

Uh oh!

SparkQA commented Jun 25, 2015

Uh oh!

viirya Jun 26, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

viirya commented Jun 26, 2015

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

viirya commented Jun 26, 2015

Uh oh!

viirya commented Jun 26, 2015

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

andrewor14 commented Jul 2, 2015

Uh oh!

andrewor14 Jul 2, 2015

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 3, 2015

Uh oh!

SparkQA commented Jul 3, 2015

Uh oh!

SparkQA commented Jul 3, 2015

Uh oh!

andrewor14 Jul 6, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jul 6, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jul 6, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jul 6, 2015

Uh oh!

andrewor14 commented Jul 7, 2015

Uh oh!

andrewor14 commented Jul 7, 2015

Uh oh!

viirya commented Jul 7, 2015

Uh oh!

andrewor14 commented Jul 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants