[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9258

zsxwing · 2015-10-24T01:03:44Z

Unlike #7021, this PR uses an approach similar to persist to compute and checkpoint an RDD at the same job. When computing an RDD at the first time, each partition will be checkpointed and read back as an iterator (maybe compute it again, not sure which one is better). After finishing the job, we also check if all partitions are checkpointed, if not, we still need to launch a job to checkpoint the missing partitions.

zsxwing · 2015-10-24T01:05:09Z

I feel confused for synchronized in RDDCheckpointData: since RDD is not thread safe, why we need to make RDDCheckpointData thread safe?

SparkQA · 2015-10-24T03:08:54Z

Test build #44276 has finished for PR 9258 at commit c909ef0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class CheckpointManager extends Logging\n

ryan-williams · 2015-10-25T22:54:22Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

just out of curiosity, any reason not to do an if/else here?

if (!isCheckpointedAndMaterialized && checkpointData.exists(_.isInstanceOf[ReliableRDDCheckpointData[T]])) { SparkEnv.get.checkpointMananger.getOrCompute( this, checkpointData.get.asInstanceOf[ReliableRDDCheckpointData[T]], split, context) } else { computeOrReadCache(split, context) }

tdas · 2015-10-26T04:31:16Z

@andrewor14 Can you please take a look at this?

SparkQA · 2015-10-26T16:24:15Z

Test build #44355 has finished for PR 9258 at commit 8ae42e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class CheckpointManager extends Logging\n

SparkQA · 2015-10-26T18:08:43Z

Test build #44360 has finished for PR 9258 at commit d69f775.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-27T10:53:35Z

Test build #44417 has finished for PR 9258 at commit 824be91.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-10-27T12:05:33Z

retest this please

zsxwing · 2015-10-27T12:17:16Z

There is still one problem in this PR. Currently, if an RDD is not persisted, this RDD will be recomputed to generate an Iterator after checkpointing (We cannot use the original Iterator since it's consumed). However, recomputing RDD may be slower than reading from the checkpoint file if the RDD is very complicated. Since we don't know whether recomputing is faster than reading from the checkpoint file, maybe we should add an option to the checkpoint API and let the user make the decision.

SparkQA · 2015-10-27T14:09:04Z

Test build #44429 has finished for PR 9258 at commit 824be91.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-10-29T11:26:39Z

retest this please

andrewor14 · 2015-10-29T11:39:46Z

@zsxwing I took a quick look and I have a high level question. Why not just do the checkpointing iterator? IIUC this approach involves reading the iterator back from disk to return the values. Wouldn't that be potentially expensive? Also, this doesn't fix it for local checkpointing.

If we have a general checkpointing iterator, then RDD doesn't have to change much and we don't need to introduce another CheckpointManager, which I find a little clunky.

SparkQA · 2015-10-29T13:25:24Z

Test build #44590 has finished for PR 9258 at commit 824be91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-11-03T01:40:20Z

@andrewor14 I just opened #9428 to take over #7021 and fix potential issues in the previous PR.

Optimize checkpointing to avoid computing an RDD twice

c909ef0

ryan-williams reviewed Oct 25, 2015
View reviewed changes

zsxwing added 2 commits October 26, 2015 22:15

Try to read from cache instead of checkpoint after checkpointing

8ae42e0

Add private[spark]

d69f775

Refactor

824be91

zsxwing closed this Nov 3, 2015

zsxwing mentioned this pull request Nov 10, 2015

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9428

Closed

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9258

[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9258

Uh oh!

Conversation

zsxwing commented Oct 24, 2015

Uh oh!

zsxwing commented Oct 24, 2015

Uh oh!

SparkQA commented Oct 24, 2015

Uh oh!

ryan-williams Oct 25, 2015

Choose a reason for hiding this comment

Uh oh!

tdas commented Oct 26, 2015

Uh oh!

SparkQA commented Oct 26, 2015

Uh oh!

SparkQA commented Oct 26, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

zsxwing commented Oct 27, 2015

Uh oh!

zsxwing commented Oct 27, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

andrewor14 commented Oct 29, 2015

Uh oh!

andrewor14 commented Oct 29, 2015

Uh oh!

SparkQA commented Oct 29, 2015

Uh oh!

zsxwing commented Nov 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants