Skip to content

Conversation

@andrewor14
Copy link
Contributor

Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through rdd.checkpoint(), which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply without providing the same level of fault tolerance.

Local checkpointing writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator rdd.localCheckpoint() and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently.

The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. Design doc.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36749 has finished for PR 7279 at commit 1324d25.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo
Copy link
Contributor

witgo commented Jul 8, 2015

This is very cool PR.

@andrewor14
Copy link
Contributor Author

Thank you @witgo :)

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36756 has finished for PR 7279 at commit 6602052.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36761 has finished for PR 7279 at commit d980757.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36815 has finished for PR 7279 at commit ee8e85e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36826 has finished for PR 7279 at commit 03f3126.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36836 has finished for PR 7279 at commit 3d4a717.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36820 has finished for PR 7279 at commit 125af6f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Andrew Or added 14 commits July 8, 2015 19:59
This commit makes two classes abstract: `RDDCheckpointData` and
`CheckpointRDD`. It implements the existing fault-tolerant
checkpointing by subclassing these abstract classes. The goal
of this commit is to retain as much functionality as possible.
Much of the code is just moved from one file to another.

The following commits will add an implementation for unreliable
checkpointing.
The parent base class was not serializable while the child is,
causing some java invocation exception. Also, a prior code clean
up caused an array out of bounds exception, which is now fixed
in this commit.
The write path runs a job to put each partition into disk store,
while the read path simply reads these blocks back from the disk
store.
This commit simplifies the previous one in removing the special
`LocalCheckpointBlockId`, which is not needed if we use the
checkpoint RDD's ID instead of the parent RDD's. This allows us
to simply reuse the RDD cleanup code path, which is nice.
This commit makes each test in CheckpointSuite run twice, once
for normal checkpointing and another time for local checkpointing.
This commit also fixes legitimate test failures after the
refactoring.
This commit does several things:

(1) First, LocalCheckpointRDD is made significantly simplier.
Instead of fetching block IDs from everyone and verifying whether
the partition indices are continuous, we simply use the original
RDD's partition indices.

(2) Many checkpoint-related methods are now documented, and failure
conditions in local checkpointing now present more informative
error messages.

(3) General code clean ups (reordering things for readability)
@andrewor14 andrewor14 changed the title [WIP] [SPARK-7292] Cheap checkpointing [SPARK-7292] Cheap checkpointing Jul 9, 2015
This augments the existing end-to-end tests in CheckpointSuite.
@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36882 has finished for PR 7279 at commit 5da18c7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Aug 2, 2015

I did a pass. And there is one critical question that is not clear to me -- how does the LocalCheckpointRDDData.doCheckpoint() work in a distributed manner.

@andrewor14
Copy link
Contributor Author

OK, I fixed the concern with the local doCheckpoint(). That was a good catch. Please have another look.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39406 has finished for PR 7279 at commit 3be5aea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SpecificSafeProjection extends $
    • case class FromUTCTimestamp(left: Expression, right: Expression)
    • case class ToUTCTimestamp(left: Expression, right: Expression)
    • case class DateDiff(endDate: Expression, startDate: Expression)
    • case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why would this ever fail if the rdd.collect() gives same result before and after checkpointing has occurred? If above you simply test whether rdd.collect() is sufficient, then these further tests seems superfluous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing the following case:

rdd.localCheckpoint().map(...).filter(...).reduceByKey(...).first()

where the action doesn't happen immediately after the local checkpoint. This is a real case that needs to be tested because we need to look at the last RDD's ancestors to see whether they are checkpointed even if the last RDD is not.

@andrewor14
Copy link
Contributor Author

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned in the earlier comment thread, you may have missed it. Recommenting it here.

This rdd.count isnt great either. Even when its cached, it may be cached on disk, or serialized in memory. In which case running a count may be costly and time consuming, and pretty much defeats the purpose of making this a cheap checkpointing. Also, in the majority of the cases, this will be fully cached, in which case running this job is superfluous. The right thing to do (which aint too hard) is to find out which partitions are missing and only run those partitions.

val missingPartitionIds = rdd.partition.filter { p => 
   !blockManagerMaster.contains(RDDBlockId(rdd.id, p.index)) 
}.map { _.index }

rdd.sparkContext.runJob(
  rdd, 
  (tc: TaskContext, iterator: Iterator[T]) => Utils.getIteratorSize(iterator)  // same as count()
  missintPartitionIds
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright. Ideally the fix in SPARK-8582 will make the need to do this go away completely, but in the mean time we'll go with your suggestion.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #1282 has finished for PR 7279 at commit 3be5aea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #1281 has finished for PR 7279 at commit 3be5aea.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39442 has finished for PR 7279 at commit 3be5aea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1284 has finished for PR 7279 at commit 34bc059.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaKMeansExample
    • class FreqSequence[Item](val sequence: Array[Array[Item]], val freq: Long) extends Serializable
    • class PrefixSpanModel[Item](val freqSequences: RDD[PrefixSpan.FreqSequence[Item]])
    • class SpecificSafeProjection extends $
    • case class FromUTCTimestamp(left: Expression, right: Expression)
    • case class ToUTCTimestamp(left: Expression, right: Expression)
    • case class DateDiff(endDate: Expression, startDate: Expression)
    • case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • public final class UnsafeKVExternalSorter

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1285 has finished for PR 7279 at commit 34bc059.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FreqSequence[Item](val sequence: Array[Array[Item]], val freq: Long) extends Serializable
    • class PrefixSpanModel[Item](val freqSequences: RDD[PrefixSpan.FreqSequence[Item]])
    • public final class UnsafeKVExternalSorter

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39466 has finished for PR 7279 at commit 34bc059.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1286 has finished for PR 7279 at commit 34bc059.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This proves that the test is valid!
@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39495 has finished for PR 7279 at commit 729600f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class RequestExecutors(appId: String, requestedTotal: Int)
    • case class KillExecutors(appId: String, executorIds: Seq[String])
    • public class JavaKMeansExample
    • class FreqSequence[Item](val sequence: Array[Array[Item]], val freq: Long) extends Serializable
    • class PrefixSpanModel[Item](val freqSequences: RDD[PrefixSpan.FreqSequence[Item]])
    • class SpecificSafeProjection extends $
    • case class FromUTCTimestamp(left: Expression, right: Expression)
    • case class ToUTCTimestamp(left: Expression, right: Expression)
    • case class DateDiff(endDate: Expression, startDate: Expression)
    • case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • public final class UnsafeKVExternalSorter

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1301 has finished for PR 7279 at commit 729600f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1300 has finished for PR 7279 at commit 34bc059.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1302 timed out for PR 7279 at commit 729600f after a configured wait of 175m.

@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39536 has finished for PR 7279 at commit 729600f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1304 has finished for PR 7279 at commit 729600f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

@tdas good to go?

@tdas
Copy link
Contributor

tdas commented Aug 3, 2015

Yep, LGTM. Merging this to master. Great patch!! Thanks!

@asfgit asfgit closed this in b41a327 Aug 3, 2015
@andrewor14 andrewor14 deleted the local-checkpoint branch August 3, 2015 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants