-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dataset API #19805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
abe03ab
aa70ce6
59b5562
c5f1b2c
54b7f33
c743c34
9beb375
da34c4a
7532fc4
45f4bf5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -527,7 +527,7 @@ class Dataset[T] private[sql]( | |
| */ | ||
| @Experimental | ||
| @InterfaceStability.Evolving | ||
| def checkpoint(): Dataset[T] = checkpoint(eager = true) | ||
| def checkpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = true) | ||
|
|
||
| /** | ||
| * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the | ||
|
|
@@ -540,9 +540,52 @@ class Dataset[T] private[sql]( | |
| */ | ||
| @Experimental | ||
| @InterfaceStability.Evolving | ||
| def checkpoint(eager: Boolean): Dataset[T] = { | ||
| def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, reliableCheckpoint = true) | ||
|
|
||
| /** | ||
| * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be | ||
| * used to truncate the logical plan of this Dataset, which is especially useful in iterative | ||
| * algorithms where the plan may grow exponentially. Local checkpoints are written to executor | ||
| * storage and despite potentially faster they are unreliable and may compromise job completion. | ||
| * | ||
| * @group basic | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add |
||
| * @since 2.3.0 | ||
| */ | ||
| @Experimental | ||
| @InterfaceStability.Evolving | ||
| def localCheckpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = false) | ||
|
|
||
| /** | ||
| * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate | ||
| * the logical plan of this Dataset, which is especially useful in iterative algorithms where the | ||
| * plan may grow exponentially. Local checkpoints are written to executor storage and despite | ||
| * potentially faster they are unreliable and may compromise job completion. | ||
| * | ||
| * @group basic | ||
| * @since 2.3.0 | ||
| */ | ||
| @Experimental | ||
| @InterfaceStability.Evolving | ||
| def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint( | ||
| eager = eager, | ||
| reliableCheckpoint = false | ||
| ) | ||
|
|
||
| /** | ||
| * Returns a checkpointed version of this Dataset. | ||
| * | ||
| * @param eager Whether to checkpoint this dataframe immediately | ||
| * @param reliableCheckpoint Whether to create a reliable checkpoint saved to files inside the | ||
| * checkpoint directory. If false creates a local checkpoint using | ||
| * the caching subsystem | ||
| */ | ||
| private def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T] = { | ||
| val internalRdd = queryExecution.toRdd.map(_.copy()) | ||
| internalRdd.checkpoint() | ||
| if (reliableCheckpoint) { | ||
| internalRdd.checkpoint() | ||
| } else { | ||
| internalRdd.localCheckpoint() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you also issue a logWarning message here to indicate the checkpoint is not reliable? This call is a potential issue when users using AWS EC2 Spot instances.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi. Thanks for the review.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @zsxwing |
||
| } | ||
|
|
||
| if (eager) { | ||
| internalRdd.count() | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the test case of
def checkpoint? At least we need to add a test case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to create a test to localCheckpoint based on the one for checkpoint, but I'm not very familiar with Scala and the Spark scala API, so currently I don't feel at ease to create a meaningful test. Would anybody be up to add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we already test checkpoint in DatasetSuite