[SPARK-1855] Local checkpointing #7279

andrewor14 · 2015-07-08T01:56:57Z

Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through rdd.checkpoint(), which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply without providing the same level of fault tolerance.

Local checkpointing writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator rdd.localCheckpoint() and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently.

The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. Design doc.

SparkQA · 2015-07-08T02:00:36Z

Test build #36749 has finished for PR 7279 at commit 1324d25.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

witgo · 2015-07-08T02:27:36Z

This is very cool PR.

andrewor14 · 2015-07-08T04:31:29Z

Thank you @witgo :)

SparkQA · 2015-07-08T04:48:37Z

Test build #36756 has finished for PR 7279 at commit 6602052.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-08T06:58:00Z

Test build #36761 has finished for PR 7279 at commit d980757.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-08T19:06:09Z

Test build #36815 has finished for PR 7279 at commit ee8e85e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-08T19:47:53Z

Test build #36826 has finished for PR 7279 at commit 03f3126.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-08T21:17:00Z

Test build #36836 has finished for PR 7279 at commit 3d4a717.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-08T21:18:31Z

Test build #36820 has finished for PR 7279 at commit 125af6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This commit makes two classes abstract: `RDDCheckpointData` and `CheckpointRDD`. It implements the existing fault-tolerant checkpointing by subclassing these abstract classes. The goal of this commit is to retain as much functionality as possible. Much of the code is just moved from one file to another. The following commits will add an implementation for unreliable checkpointing.

The parent base class was not serializable while the child is, causing some java invocation exception. Also, a prior code clean up caused an array out of bounds exception, which is now fixed in this commit.

The write path runs a job to put each partition into disk store, while the read path simply reads these blocks back from the disk store.

This commit simplifies the previous one in removing the special `LocalCheckpointBlockId`, which is not needed if we use the checkpoint RDD's ID instead of the parent RDD's. This allows us to simply reuse the RDD cleanup code path, which is nice.

This commit makes each test in CheckpointSuite run twice, once for normal checkpointing and another time for local checkpointing. This commit also fixes legitimate test failures after the refactoring.

This commit does several things: (1) First, LocalCheckpointRDD is made significantly simplier. Instead of fetching block IDs from everyone and verifying whether the partition indices are continuous, we simply use the original RDD's partition indices. (2) Many checkpoint-related methods are now documented, and failure conditions in local checkpointing now present more informative error messages. (3) General code clean ups (reordering things for readability)

This augments the existing end-to-end tests in CheckpointSuite.

SparkQA · 2015-07-09T03:06:32Z

Test build #36882 has finished for PR 7279 at commit 5da18c7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-08-02T04:30:06Z

I did a pass. And there is one critical question that is not clear to me -- how does the LocalCheckpointRDDData.doCheckpoint() work in a distributed manner.

andrewor14 · 2015-08-02T05:27:12Z

OK, I fixed the concern with the local doCheckpoint(). That was a good catch. Please have another look.

SparkQA · 2015-08-02T07:26:57Z

Test build #39406 has finished for PR 7279 at commit 3be5aea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SpecificSafeProjection extends $
- case class FromUTCTimestamp(left: Expression, right: Expression)
- case class ToUTCTimestamp(left: Expression, right: Expression)
- case class DateDiff(endDate: Expression, startDate: Expression)
- case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

tdas · 2015-08-02T09:21:53Z

core/src/test/scala/org/apache/spark/rdd/LocalCheckpointSuite.scala

Nit: Why would this ever fail if the rdd.collect() gives same result before and after checkpointing has occurred? If above you simply test whether rdd.collect() is sufficient, then these further tests seems superfluous.

This is testing the following case:

rdd.localCheckpoint().map(...).filter(...).reduceByKey(...).first()

where the action doesn't happen immediately after the local checkpoint. This is a real case that needs to be tested because we need to look at the last RDD's ancestors to see whether they are checkpointed even if the last RDD is not.

andrewor14 · 2015-08-02T20:48:19Z

retest this please

tdas · 2015-08-02T21:40:08Z

core/src/main/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala

I mentioned in the earlier comment thread, you may have missed it. Recommenting it here.

This rdd.count isnt great either. Even when its cached, it may be cached on disk, or serialized in memory. In which case running a count may be costly and time consuming, and pretty much defeats the purpose of making this a cheap checkpointing. Also, in the majority of the cases, this will be fully cached, in which case running this job is superfluous. The right thing to do (which aint too hard) is to find out which partitions are missing and only run those partitions.

val missingPartitionIds = rdd.partition.filter { p => !blockManagerMaster.contains(RDDBlockId(rdd.id, p.index)) }.map { _.index } rdd.sparkContext.runJob( rdd, (tc: TaskContext, iterator: Iterator[T]) => Utils.getIteratorSize(iterator) // same as count() missintPartitionIds )

alright. Ideally the fix in SPARK-8582 will make the need to do this go away completely, but in the mean time we'll go with your suggestion.

SparkQA · 2015-08-02T22:39:02Z

Test build #1282 has finished for PR 7279 at commit 3be5aea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-02T22:42:20Z

Test build #1281 has finished for PR 7279 at commit 3be5aea.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-02T23:04:44Z

Test build #39442 has finished for PR 7279 at commit 3be5aea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T01:12:08Z

Test build #1284 has finished for PR 7279 at commit 34bc059.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaKMeansExample
- class FreqSequence[Item](val sequence: Array[Array[Item]], val freq: Long) extends Serializable
- class PrefixSpanModel[Item](val freqSequences: RDD[PrefixSpan.FreqSequence[Item]])
- class SpecificSafeProjection extends $
- case class FromUTCTimestamp(left: Expression, right: Expression)
- case class ToUTCTimestamp(left: Expression, right: Expression)
- case class DateDiff(endDate: Expression, startDate: Expression)
- case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- public final class UnsafeKVExternalSorter

SparkQA · 2015-08-03T01:15:41Z

Test build #1285 has finished for PR 7279 at commit 34bc059.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FreqSequence[Item](val sequence: Array[Array[Item]], val freq: Long) extends Serializable
- class PrefixSpanModel[Item](val freqSequences: RDD[PrefixSpan.FreqSequence[Item]])
- public final class UnsafeKVExternalSorter

SparkQA · 2015-08-03T01:17:19Z

Test build #39466 has finished for PR 7279 at commit 34bc059.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T01:21:08Z

Test build #1286 has finished for PR 7279 at commit 34bc059.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This proves that the test is valid!

SparkQA · 2015-08-03T04:20:28Z

Test build #39495 has finished for PR 7279 at commit 729600f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RequestExecutors(appId: String, requestedTotal: Int)
- case class KillExecutors(appId: String, executorIds: Seq[String])
- public class JavaKMeansExample
- class FreqSequence[Item](val sequence: Array[Array[Item]], val freq: Long) extends Serializable
- class PrefixSpanModel[Item](val freqSequences: RDD[PrefixSpan.FreqSequence[Item]])
- class SpecificSafeProjection extends $
- case class FromUTCTimestamp(left: Expression, right: Expression)
- case class ToUTCTimestamp(left: Expression, right: Expression)
- case class DateDiff(endDate: Expression, startDate: Expression)
- case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- public final class UnsafeKVExternalSorter

SparkQA · 2015-08-03T04:40:27Z

Test build #1301 has finished for PR 7279 at commit 729600f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T04:57:51Z

Test build #1300 has finished for PR 7279 at commit 34bc059.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T05:24:13Z

Test build #1302 timed out for PR 7279 at commit 729600f after a configured wait of 175m.

andrewor14 · 2015-08-03T07:42:13Z

retest this please

SparkQA · 2015-08-03T10:11:32Z

Test build #39536 has finished for PR 7279 at commit 729600f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T15:40:14Z

Test build #1304 has finished for PR 7279 at commit 729600f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-08-03T16:36:24Z

@tdas good to go?

tdas · 2015-08-03T17:57:52Z

Yep, LGTM. Merging this to master. Great patch!! Thanks!

andrewor14 force-pushed the local-checkpoint branch from 125af6f to 03f3126 Compare July 8, 2015 19:28

Andrew Or added 14 commits July 8, 2015 19:59

Fix tests

8447454

The parent base class was not serializable while the child is, causing some java invocation exception. Also, a prior code clean up caused an array out of bounds exception, which is now fixed in this commit.

First implementation of local checkpointing

2e902e5

The write path runs a job to put each partition into disk store, while the read path simply reads these blocks back from the disk store.

Rename a few methods with awkward names (minor)

0477eec

Refactor CheckpointSuite to test local checkpointing

4dbbab1

This commit makes each test in CheckpointSuite run twice, once for normal checkpointing and another time for local checkpointing. This commit also fixes legitimate test failures after the refactoring.

Add local checkpoint clean up tests

2e59646

Add a few warnings and clear exception messages

56831c5

Fix style

e53d964

Fix mima?

172cb66

Fix mima

d096c67

Fix style

4880deb

Rename a few more awkwardly named methods (minor)

53b363b

andrewor14 force-pushed the local-checkpoint branch from 3d4a717 to 5da18c7 Compare July 9, 2015 03:00

andrewor14 changed the title ~~[WIP] [SPARK-7292] Cheap checkpointing~~ [SPARK-7292] Cheap checkpointing Jul 9, 2015

andrewor14 force-pushed the local-checkpoint branch from 5da18c7 to 0db1e4b Compare July 9, 2015 03:04

Add fine-grained tests for local checkpointing

4a182f3

This augments the existing end-to-end tests in CheckpointSuite.

andrewor14 force-pushed the local-checkpoint branch from 0db1e4b to 4a182f3 Compare July 9, 2015 03:06

Andrew Or added 2 commits August 1, 2015 22:07

Merge branch 'master' of github.com:apache/spark into local-checkpoint

bf846a6

Address comments

3be5aea

tdas reviewed Aug 2, 2015
View reviewed changes

Andrew Or added 2 commits August 2, 2015 16:14

Merge branch 'master' of github.com:apache/spark into local-checkpoint

e43bbb6

Avoid computing all partitions in local checkpoint

34bc059

Oops, fix tests

729600f

This proves that the test is valid!

asfgit closed this in b41a327 Aug 3, 2015

andrewor14 deleted the local-checkpoint branch August 3, 2015 18:27

[SPARK-1855] Local checkpointing #7279

[SPARK-1855] Local checkpointing #7279

Uh oh!

Conversation

andrewor14 commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

witgo commented Jul 8, 2015

Uh oh!

andrewor14 commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

tdas commented Aug 2, 2015

Uh oh!

andrewor14 commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

tdas Aug 2, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Aug 2, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Aug 2, 2015

Uh oh!

tdas Aug 2, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Aug 2, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

andrewor14 commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

andrewor14 commented Aug 3, 2015

Uh oh!

tdas commented Aug 3, 2015

Uh oh!

Reviewers

Assignees