Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Oct 24, 2015

Unlike #7021, this PR uses an approach similar to persist to compute and checkpoint an RDD at the same job. When computing an RDD at the first time, each partition will be checkpointed and read back as an iterator (maybe compute it again, not sure which one is better). After finishing the job, we also check if all partitions are checkpointed, if not, we still need to launch a job to checkpoint the missing partitions.

@zsxwing
Copy link
Member Author

zsxwing commented Oct 24, 2015

I feel confused for synchronized in RDDCheckpointData: since RDD is not thread safe, why we need to make RDDCheckpointData thread safe?

@SparkQA
Copy link

SparkQA commented Oct 24, 2015

Test build #44276 has finished for PR 9258 at commit c909ef0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class CheckpointManager extends Logging\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity, any reason not to do an if/else here?

if (!isCheckpointedAndMaterialized && 
    checkpointData.exists(_.isInstanceOf[ReliableRDDCheckpointData[T]])) {
  SparkEnv.get.checkpointMananger.getOrCompute(
    this, checkpointData.get.asInstanceOf[ReliableRDDCheckpointData[T]], split, context)
} else {
  computeOrReadCache(split, context)
}

@tdas
Copy link
Contributor

tdas commented Oct 26, 2015

@andrewor14 Can you please take a look at this?

@SparkQA
Copy link

SparkQA commented Oct 26, 2015

Test build #44355 has finished for PR 9258 at commit 8ae42e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class CheckpointManager extends Logging\n

@SparkQA
Copy link

SparkQA commented Oct 26, 2015

Test build #44360 has finished for PR 9258 at commit d69f775.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44417 has finished for PR 9258 at commit 824be91.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Oct 27, 2015

retest this please

@zsxwing
Copy link
Member Author

zsxwing commented Oct 27, 2015

There is still one problem in this PR. Currently, if an RDD is not persisted, this RDD will be recomputed to generate an Iterator after checkpointing (We cannot use the original Iterator since it's consumed). However, recomputing RDD may be slower than reading from the checkpoint file if the RDD is very complicated. Since we don't know whether recomputing is faster than reading from the checkpoint file, maybe we should add an option to the checkpoint API and let the user make the decision.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44429 has finished for PR 9258 at commit 824be91.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

@zsxwing I took a quick look and I have a high level question. Why not just do the checkpointing iterator? IIUC this approach involves reading the iterator back from disk to return the values. Wouldn't that be potentially expensive? Also, this doesn't fix it for local checkpointing.

If we have a general checkpointing iterator, then RDD doesn't have to change much and we don't need to introduce another CheckpointManager, which I find a little clunky.

@SparkQA
Copy link

SparkQA commented Oct 29, 2015

Test build #44590 has finished for PR 9258 at commit 824be91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Nov 3, 2015

@andrewor14 I just opened #9428 to take over #7021 and fix potential issues in the previous PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants