[SPARK-18294][CORE] Implement commit protocol to support `mapred` package's committer #15861

jiangxb1987 · 2016-11-12T08:13:39Z

What changes were proposed in this pull request?

This PR makes the following changes:

Implement a new commit protocol named HadoopMapRedCommitProtocol which support the old mapred package's committer;
Refactor SparkHadoopWriter and SparkHadoopMapReduceWriter, now they are combined together, thus we can support write through both mapred and mapreduce API by the new SparkHadoopWriter, a lot of duplicated codes are removed;
Moved SparkHadoopWriterUtils to a seprated file.

How was this patch tested?

This PR is not changing any behavior, so it is tested by the existing test cases.

jiangxb1987

This PR is ready for review.

jiangxb1987 · 2016-11-12T08:24:12Z

core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala

This file is moved to internal/io/

jiangxb1987 · 2016-11-12T08:28:38Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapRedCommitProtocol.scala

Since most of the classes in mapred extend from mapreduce, we only have to override setupCommitter to make it support the OutputCommitter from mapred API. But the SparkHadoopWriter requires extensive refactoring.

jiangxb1987 · 2016-11-12T08:31:06Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

This have been merged with the origin SparkHadoopWriter, the basic work flow of this doesn't change but a SparkHadoopWriterConfig class is imported to create output Format/Committer/Writer from JobConf/Configuration.

jiangxb1987 · 2016-11-12T08:31:44Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

This is moved to a seprated file, with the content unchanged.

jiangxb1987 · 2016-11-12T08:35:39Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

We call initOutputFormat here, encapsulate OutputFormat into SparkHadoopWriterConfig because the output format classes from mapred and mapreduce package don't have common super class.

jiangxb1987 · 2016-11-12T08:38:14Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

This logic appears duplicatedly in PairRDDFunctions, let's move it here and delete at all other places.

jiangxb1987 · 2016-11-12T08:44:38Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

This class creates output Format/Committer/Writer from JobConf using the mapred API, mainly create these stuffs from conf.get.

jiangxb1987 · 2016-11-12T08:47:44Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

This supports the new mapreduce API, which creates OutputFormat from jobContext.

jiangxb1987 · 2016-11-12T08:51:22Z

core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala

We are not supposed to create a SparkHadoopWriter inside this method, we can just create a HadoopMapRedCommitProtocol instead.

SparkQA · 2016-11-12T10:13:49Z

Test build #68558 has finished for PR 15861 at commit 7f7303e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-12T17:17:31Z

Test build #68562 has finished for PR 15861 at commit ff4ce8c.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-13T02:26:29Z

retest this please - looks it has passed all the test cases but the build didn't commit.

SparkQA · 2016-11-13T04:17:13Z

Test build #68576 has finished for PR 15861 at commit ff4ce8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-14T02:12:56Z

cc @mridulm too

rxin · 2016-11-14T02:17:26Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriterConfig.scala

do we need this? it seems like we don't want to make this configurable and there will only be two places that call this. Why not just have those two callers invoke the right constructor?

Reasonable, I'll address this.

rxin · 2016-11-14T02:20:24Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriterConfig.scala

One thing that confuses me is why this is named "Config"?

If I understand this correctly, this is basically an abstraction that makes both the old mapred API and the new mapreduce API work, isn't it?

Yes, it's an abstraction that conceal the differences between using the mapred and the mapreduce API. It is called SparkHadoopWriterConfig because we create everything from JobConf/Configuration, but I believe there is a more concise name for it, any suggestion?

maybe just HadoopWriteConfigUtil ?

Sure - I'll update that.

jiangxb1987 · 2016-11-14T04:18:41Z

Looks the failed test suite SparkListenerWithClusterSuite is not writing anything so I wonder whether it is related to what we have changed here.

SparkQA · 2016-11-14T04:23:14Z

Test build #3423 has finished for PR 15861 at commit ff4ce8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-14T07:00:39Z

Test build #68600 has finished for PR 15861 at commit 2a73827.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-15T09:18:18Z

@mridulm Would you please look at this when you have time? Thank you!

jiangxb1987 · 2016-11-17T07:10:48Z

Would anyone look at this PR please？

SparkQA · 2016-11-17T11:55:07Z

Test build #68762 has finished for PR 15861 at commit f826a5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-19T18:54:24Z

Test build #68894 has finished for PR 15861 at commit bedcd10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-21T07:57:37Z

cc @mridulm can you take a look?

mridulm · 2016-11-21T11:17:48Z

@rxin I did see this PR, unfortunately it is a bit big and I am tied up with other things - cant get to it for next few days.

mridulm · 2016-11-13T08:32:33Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

Split into multiple lines

mridulm · 2016-11-15T17:59:56Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapRedCommitProtocol.scala

Do we need a setupJob on the committer here ?

mridulm · 2016-11-15T18:11:32Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

There is a behavior change here - in earlier code, a new instance was used to check the output specs against.
Here, it is the same instance : IMO this should be fine, but wanted to call it out in case someone has thoughts on it.

Also, why not move this into assertConf with a isOutputSpecValidationEnabled check ?

mridulm · 2016-11-29T21:22:53Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

-   * @note We should make sure our tasks are idempotent when speculation is enabled, i.e. do
-   * not use output committer that writes data directly.
-   * There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
-   * result of using direct output committer with speculation enabled.


Why was this removed ? It is still relevant now even if checked in a different method invoked from here

mridulm · 2016-11-29T21:36:35Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

+  // --------------------------------------------------------------------------
+
+  def createJobContext(jobTrackerId: String, jobId: Int): NewJobContext = {
+    val jobAttemptId = new SerializableWritable(new JobID(jobTrackerId, jobId))


Why wrap it in SerializableWritable ?

mridulm · 2016-11-29T21:50:09Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

+    FileCommitProtocol.instantiate(
+      className = classOf[HadoopMapReduceCommitProtocol].getName,
+      jobId = jobId.toString,
+      outputPath = getConf().get("mapred.output.dir"),


This should be org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.OUTDIR ?
If yes, we should have a test which fails for this case to catch future bugs.

mridulm · 2016-11-29T22:00:33Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

+  // --------------------------------------------------------------------------
+
+  def assertConf(): Unit = {
+    // Do nothing for mapreduce API.


I see a bunch of validations being done in saveAsHadoopDataset - shouldn't they not be here ?
Which includes SparkHadoopUtil.get.addCredentials, etc.

mridulm · 2016-11-29T22:20:11Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

+      val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
+        while (iterator.hasNext) {
+          val pair = iterator.next()
+          config.write(pair)


I hope this gets JIT'ed away ...

mridulm · 2016-11-29T22:26:21Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

-      // FileOutputFormat ignores the filesystem parameter
-      val ignoredFs = FileSystem.get(hadoopConf)
-      hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf)
-    }


These validations should go into HadoopMapReduceWriteConfigUtil

mridulm · 2016-11-29T22:26:52Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

        "ignored", pairs.keyClass, pairs.valueClass, classOf[FakeFormatWithCallback], conf)
    }
-    assert(e.getMessage contains "failed to write")
+    assert(e.getCause.getMessage contains "failed to write")


Curious, how/why did this change ?

mridulm · 2016-11-29T22:31:14Z

@jiangxb1987 I did a single pass review - particularly given the similarities in both the codepaths and the classnames, I will need to go over it again to ensure we dont miss anything.

HyukjinKwon · 2017-02-09T14:27:10Z

(gentle ping @jiangxb1987)

jiangxb1987 · 2017-02-17T09:26:11Z

This PR should be separated into some smaller ones, I'll do this at about March.

jiangxb1987 commented Nov 12, 2016

View reviewed changes

rxin reviewed Nov 14, 2016

View reviewed changes

jiangxb1987 added 6 commits November 20, 2016 00:21

commit protocol support mapred API

ac7ca2a

refactor.

3d64595

clean code.

4dffc34

bugfix - should update JobConf on JobContext/TaskContext creation.

6d68b4f

create SparkHadoopWriterConfig directly in PairRDDFunctions.

7134e55

rename SparkHadoopWriterConfig to HadoopWriteConfigUtil

bedcd10

jiangxb1987 force-pushed the mapred-commit-protocol branch from f826a5a to bedcd10 Compare November 19, 2016 16:29

mridulm reviewed Nov 29, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

[SPARK-18294][CORE] Implement commit protocol to support mapred package's committer #15861

[SPARK-18294][CORE] Implement commit protocol to support mapred package's committer #15861

Uh oh!

Conversation

jiangxb1987 commented Nov 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 12, 2016

Uh oh!

SparkQA commented Nov 12, 2016

Uh oh!

jiangxb1987 commented Nov 13, 2016

Uh oh!

SparkQA commented Nov 13, 2016

Uh oh!

rxin commented Nov 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Nov 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Nov 14, 2016

Uh oh!

SparkQA commented Nov 14, 2016

Uh oh!

SparkQA commented Nov 14, 2016

Uh oh!

jiangxb1987 commented Nov 15, 2016

Uh oh!

jiangxb1987 commented Nov 17, 2016

Uh oh!

SparkQA commented Nov 17, 2016

Uh oh!

SparkQA commented Nov 19, 2016

Uh oh!

rxin commented Nov 21, 2016

Uh oh!

mridulm commented Nov 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

[SPARK-18294][CORE] Implement commit protocol to support `mapred` package's committer #15861

[SPARK-18294][CORE] Implement commit protocol to support `mapred` package's committer #15861

jiangxb1987 Nov 14, 2016 •

edited

Loading