Skip to content

Conversation

@jkbradley
Copy link
Member

PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs. This PR generalizes it.

For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.

I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684]. It should be useful for other iterative algorithms as well.

Changes I made:

  • Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
  • Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.).
  • The subclasses for Graphs and RDDs implement those abstract methods.
  • I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs.

To review this PR, I recommend doing 2 diffs:
(1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
(2) diff between the 2 test suites

CCing @andrewor14 in case there are relevant changes to checkpointing.
CCing @feynmanliang in case you're interested in learning about checkpointing.
CCing @mengxr for final OK.
Thanks all!

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38729 has finished for PR 7728 at commit 568918c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need currentData at construction time? It might be cleaner to let user call update to add the initial dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem with not doing that is that the type parameter have to be given explicitly to the constructor, but that's fine with me. I'll make the change.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38968 has finished for PR 7728 at commit 32b23b8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.

… the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39008 has finished for PR 7728 at commit d41902c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

@mengxr This should be ready for a final pass. Thanks!

@asfgit asfgit closed this in c581593 Jul 30, 2015
@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2015

LGTM. Merged into master. Thanks! Btw, it is not necessary to specify the item type of RDD or Graph. Checkpointing doesn't care the item type. Maybe we can try RDD[_] and Graph[_, _], which might simplify the code a little bit (if it compiles).

@andrewor14
Copy link
Contributor

@jkbradley thanks, this is actually not affected by the recent checkpointing changes since we keep the old code path. In the future you can switch to calling rdd.localCheckpoint() and suddenly everything will be a little faster.

@jkbradley jkbradley deleted the gbt-checkpoint branch December 29, 2016 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants