-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8582][Core]Optimize checkpointing to avoid computing an RDD twice #9258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I feel confused for |
|
Test build #44276 has finished for PR 9258 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity, any reason not to do an if/else here?
if (!isCheckpointedAndMaterialized &&
checkpointData.exists(_.isInstanceOf[ReliableRDDCheckpointData[T]])) {
SparkEnv.get.checkpointMananger.getOrCompute(
this, checkpointData.get.asInstanceOf[ReliableRDDCheckpointData[T]], split, context)
} else {
computeOrReadCache(split, context)
}
|
@andrewor14 Can you please take a look at this? |
|
Test build #44355 has finished for PR 9258 at commit
|
|
Test build #44360 has finished for PR 9258 at commit
|
|
Test build #44417 has finished for PR 9258 at commit
|
|
retest this please |
|
There is still one problem in this PR. Currently, if an RDD is not persisted, this RDD will be recomputed to generate an Iterator after checkpointing (We cannot use the original Iterator since it's consumed). However, recomputing RDD may be slower than reading from the checkpoint file if the RDD is very complicated. Since we don't know whether recomputing is faster than reading from the checkpoint file, maybe we should add an option to the |
|
Test build #44429 has finished for PR 9258 at commit
|
|
retest this please |
|
@zsxwing I took a quick look and I have a high level question. Why not just do the checkpointing iterator? IIUC this approach involves reading the iterator back from disk to return the values. Wouldn't that be potentially expensive? Also, this doesn't fix it for local checkpointing. If we have a general checkpointing iterator, then RDD doesn't have to change much and we don't need to introduce another |
|
Test build #44590 has finished for PR 9258 at commit
|
|
@andrewor14 I just opened #9428 to take over #7021 and fix potential issues in the previous PR. |
Unlike #7021, this PR uses an approach similar to
persistto compute and checkpoint an RDD at the same job. When computing an RDD at the first time, each partition will be checkpointed and read back as an iterator (maybe compute it again, not sure which one is better). After finishing the job, we also check if all partitions are checkpointed, if not, we still need to launch a job to checkpoint the missing partitions.