Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented Jun 4, 2015

https://issues.apache.org/jira/browse/SPARK-8029

This implements one of the approaches in the design doc on the jira: now each ShuffleMapTask attempt write to a different location. ShuffleBlockId is extended to include the stage attempt id, so the fetch side knows which files to read from. MapStatus also includes the stage attempt, so now there is one MapStatus per (executor, attempt) as opposed to one per executor. This won't really matter when there is just one attempt per stage. In a pathological case, you'd end up with one MapStatus per partition, which would be much worse, but that is very unlikely.

This touches a lot of files, but almost all of the changes are just plumbing a stageAttemptId through a lot of different places.

cc @JoshRosen

squito added 30 commits May 6, 2015 19:49
…rtial fix, still have some concurrent attempts
Conflicts:
	core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
	core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Conflicts:
	core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: comment should follow javadoc formatting:

/**
 *  Comment.
 */

@vanzin
Copy link
Contributor

vanzin commented Oct 8, 2015

Looks sane, but this isn't really my area of expertise. Just a reminder that you should either enable DAGSchedulerFailureRecoverySuite or remove it from the patch.

Also, left a question about backwards compatibility.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43401 has finished for PR 6648 at commit f37be91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)
    • case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)
    • case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43470 has finished for PR 6648 at commit 37ac799.

  • This patch fails MiMa tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)
    • case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)
    • case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)

@squito
Copy link
Contributor Author

squito commented Oct 9, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43475 has finished for PR 6648 at commit c9a9e08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)
    • case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)
    • case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)
    • public final class UnsafeRow extends MutableRow implements Externalizable, KryoSerializable
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@squito
Copy link
Contributor Author

squito commented Oct 9, 2015

@vanzin @JoshRosen made external shuffle service backwards compatible and got rid of DAGSchedulerFailureRecoverySuite

@vanzin
Copy link
Contributor

vanzin commented Oct 9, 2015

I looked at the diffs since my last review, looks good.

@rxin
Copy link
Contributor

rxin commented Oct 12, 2015

I will get @JoshRosen to take a look at this.

@mateiz
Copy link
Contributor

mateiz commented Oct 13, 2015

Hey Imran,

Given the number of changes required for this approach, I wonder whether an atomic rename design wouldn't be simpler (in particular, the "first attempt wins" in the doc). The doc seems to be worried that a file output might be corrupted, but in that case, why not send a message to the node asking it to delete its old output files, and then send a new map task? It can just be the delete-block message that the block manager already supports. This seems much nicer because it doesn't require any changes to the data structures in the rest of Spark.

@mateiz
Copy link
Contributor

mateiz commented Oct 13, 2015

BTW, with that design, I also wouldn't even implement the delete message in the first patch, unless we've actually seen block corruptions happen; but it sounds like we haven't seen such things and we probably wouldn't have a great way to detect them now anyway (i.e. the reduce task would mark a fetch successful and just crash).

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45389 has finished for PR 6648 at commit fbd129b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class MasterWebUI(\n * case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)\n * case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n * case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n * public class JavaAFTSurvivalRegressionExample\n

@squito
Copy link
Contributor Author

squito commented Nov 9, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45437 has finished for PR 6648 at commit fbd129b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)\n * case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n * case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n

@squito
Copy link
Contributor Author

squito commented Nov 10, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45484 has started for PR 6648 at commit fbd129b.

@squito
Copy link
Contributor Author

squito commented Nov 10, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45528 has finished for PR 6648 at commit fbd129b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)\n * case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n * case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n

@squito
Copy link
Contributor Author

squito commented Nov 10, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45533 has finished for PR 6648 at commit fbd129b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, stageAttemptId: Int)\n * case class ShuffleDataBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n * case class ShuffleIndexBlockId(shuffleId: Int, mapId: Int, stageAttemptId: Int, reduceId: Int)\n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants