Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jun 25, 2015

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35794 has finished for PR 7021 at commit d863516.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35796 has finished for PR 7021 at commit 1a3055e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partitioners of the rdds might have different numPartitions. It will causes error later.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35854 has finished for PR 7021 at commit 3c5b203.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35859 has finished for PR 7021 at commit 3c5b203.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2015

unrelated failure again. Looks like jenkin is unstable now?

@viirya
Copy link
Member Author

viirya commented Jun 26, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35862 has finished for PR 7021 at commit 3c5b203.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@viirya Thanks for working on this. Could you add some tests for this new iterator? In particular, we should have a test that fails before but no longer fails afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should all be indented two spaces

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala
@viirya
Copy link
Member Author

viirya commented Jul 3, 2015

@andrewor14 Thanks. I have added few tests for the new iterator. Other comments are addressed too.

@SparkQA
Copy link

SparkQA commented Jul 3, 2015

Test build #36508 has finished for PR 7021 at commit a829a7d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2015

Test build #36509 has finished for PR 7021 at commit 2f43ff3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be

checkpointData
  .map(_.getCheckpointIterator(iter, context, split.index))
  .getOrElse(iter)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this read from RDDCheckpointData.rddCheckpointDataPath?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the path here should just be checkpointPath. Right now this duplicates some code.

@andrewor14
Copy link
Contributor

@viirya Thanks for the tackling this issue, but I believe the existing implementation is not fully correct.

There are two high level problems: First, if the checkpointing iterator is not fully consumed by the user, then we end up checkpointing only a subset of the computed data. I think we should ensure that the iterator is fully drained before we can safely truncate the RDD's lineage through rdd.markCheckpointed.

Second, the state transition from Initialized -> CheckpointingInProgress -> Checkpointed is not respected. In the new model, we should transition into CheckpointingInProgress as soon as the iterator is returned so multiple calls to it will not lead to the RDD being checkpointed many times. Then only after we fully iterate through the iterator can we declare the RDD as Checkpointed.

I actually don't have a great idea on how to fix the first issue, however. We do not really have any visibility on how the higher level caller with use the iterator, and if we consume it eagerly ourselves then the application might fail. @tdas this seems like a fundamentally difficult problem.

@andrewor14
Copy link
Contributor

Ah, one thing we could do is the following: in doCheckpoint, we check if the iterator still has values. If it does, then just keep calling next until it is fully drained. This ensures the RDD will always be fully checkpointed after an action.

@andrewor14
Copy link
Contributor

@viirya by the way, I'm currently working on a major refactoring of all of this code in parallel. There will likely be a lot of conflicts to resolve at this rate. If you prefer, I could take up this issue and use your patch as a basis. In the release we'll be sure to give you credit for this fix. What do you think?

@viirya
Copy link
Member Author

viirya commented Jul 7, 2015

@andrewor14 no problem, thanks.

@andrewor14
Copy link
Contributor

As discussed I have opened a patch #7279 that refactors all of this. After that one is merged I'll fix SPARK-8582 in the new refactored code based on the changes here. @viirya would you mind closing this then? Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants