[SPARK-18191][CORE] Port RDD API to use commit protocol #15769

jiangxb1987 · 2016-11-04T14:48:03Z

What changes were proposed in this pull request?

This PR port RDD API to use commit protocol, the changes made here:

Add new internal helper class that saves an RDD using a Hadoop OutputFormat named SparkNewHadoopWriter, it's similar with SparkHadoopWriter but uses commit protocol. This class supports the newer mapreduce API, instead of the old mapred API which is supported by SparkHadoopWriter;
Rewrite PairRDDFunctions.saveAsNewAPIHadoopDataset function, so it uses commit protocol now.

How was this patch tested?

Exsiting test cases.

SparkQA · 2016-11-04T14:54:27Z

Test build #68137 has finished for PR 15769 at commit e017e1e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

I didn't touch SparkHadoopWriter and PairRDDFunctions.saveAsHadoopDataset, because they use the older mapred API, which is not supported by the FileCommitProtocol framework now.

jiangxb1987 · 2016-11-04T14:49:22Z

core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala

+  def createJobTrackerID(time: Date): String = {
+    new SimpleDateFormat("yyyyMMddHHmmss", Locale.US).format(time)
+  }
+


We need to generate jobTrackerID seprately in SparkNewHadoopWriter

jiangxb1987 · 2016-11-04T14:51:50Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

      (2, ArrayBuffer(1))))
  }

-  test("saveNewAPIHadoopFile should call setConf if format is configurable") {


This logic is no longer needed.

jiangxb1987 · 2016-11-04T14:52:34Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

 /*
-  These classes are fakes for testing
-    "saveNewAPIHadoopFile should call setConf if format is configurable".
+  These classes are fakes for testing saveAsHadoopFile/saveNewAPIHadoopFile.


This comment have been out-of-date for a while.

SparkQA · 2016-11-04T17:24:39Z

Test build #68139 has finished for PR 15769 at commit 5e12850.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-04T19:08:15Z

core/src/main/scala/org/apache/spark/SparkNewHadoopWriter.scala

+ * OutputCommitter, is serializable.
+ */
+private[spark]
+class SparkNewHadoopWriter(


move this into internal/io

rxin · 2016-11-05T01:11:05Z

Doesn't need to be in this PR, but can you also implement a HadoopMapRedCommitProtocol that supports the older mapred package's commiter?

rxin · 2016-11-05T01:11:22Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

+    // Instantiate writer
+    val committer = FileCommitProtocol.instantiate(
+      className = classOf[HadoopMapReduceCommitProtocol].getName,
+      jobId = stageId.toString,


is this how we determine the old job id? i thought it had some date in it too

In fact I'm not sure what value should be assigned to jobId here, should it be jobTrackerId combined with jobId, or any value else? I failed to find some code to follow on this topic.

rxin · 2016-11-05T01:14:14Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

      jobFormat.checkOutputSpecs(job)
    }

+    // Instantiate writer


do you think you can move most of the logic from saveAsNewAPIHadoopDataset into SparkNewHadoopWriter? It'd be similar to how FileFormatWriter works, but much simpler because there is no dynamic partition insert.

SparkQA · 2016-11-05T02:11:57Z

Test build #68168 has finished for PR 15769 at commit 4e72745.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-05T18:30:05Z

Test build #68213 has finished for PR 15769 at commit 09d5ed9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-05T18:32:34Z

Moved the logic from saveAsNewAPIHadoopDataset into SparkNewHadoopWriter, also added SparkNewHadoopWriterUtils to store functions used by Hadoop/NewHadoop writer.
In a followup PR, I'll implement a similar protocol to support the older mapred package's committer.

SparkQA · 2016-11-05T18:55:26Z

Test build #68214 has finished for PR 15769 at commit e5a60ff.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-05T20:09:51Z

core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala

 import org.apache.hadoop.mapred._
 import org.apache.hadoop.mapreduce.TaskType

 import org.apache.spark.internal.Logging


i'd move this file into spark/internal/io as well to be closer to SparkNewHadoopWriter.scala

we can do it in a separate pr when you do the mapred committer.

Yes I'll do it later.

rxin · 2016-11-05T20:11:08Z

core/src/main/scala/org/apache/spark/internal/io/SparkNewHadoopWriter.scala

+ * (from the newer mapreduce API, not the old mapred API).
+ */
+private[spark]
+object SparkNewHadoopWriter extends Logging {


would it make more sense to call it SparkHadoopMapReduceWriter to be more consistent? I understand we use "new" vs "old" in the RDD API, but that's always been fairly confusing to me.

rxin · 2016-11-05T20:11:29Z

core/src/main/scala/org/apache/spark/internal/io/SparkNewHadoopWriter.scala

+object SparkNewHadoopWriter extends Logging {
+
+  /** A shared job description for all the write tasks. */
+  private class WriteJobDescription[K, V](


maybe just remove this, since you have only 3 items here. In SQL there were a lot of items.

rxin · 2016-11-05T20:12:06Z

core/src/main/scala/org/apache/spark/internal/io/SparkNewHadoopWriter.scala

+  def write[K, V: ClassTag](
+      sparkContext: SparkContext,
+      rdd: RDD[(K, V)],
+      committer: HadoopMapReduceCommitProtocol,


i'd move the creation of the commit protocol here here. The reason I put it outside in SQL was because streaming and batch needed to specify different protocols, but that problem doesn't exist in core.

SparkQA · 2016-11-05T21:33:57Z

Test build #68215 has finished for PR 15769 at commit 86f1951.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-06T03:54:35Z

Test build #68219 has finished for PR 15769 at commit cfcd823.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-06T04:04:44Z

The failed test case is imported by #15725, which is not related to our changes in this PR.

rxin · 2016-11-06T21:43:14Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

+
+    // Try to write all RDD partitions as a Hadoop OutputFormat.
+    try {
+      sparkContext.runJob(rdd, (context: TaskContext, iter: Iterator[(K, V)]) => {


we need to collect the result coming from the commit protocol here, and pass it into commitJob, don't we?

rxin · 2016-11-06T21:50:51Z

The rest looks good.

cc @ericl for another look too

rxin · 2016-11-06T22:19:43Z

Also @mridulm who has looked at one of the prs for commit protocol too.

@mridulm can you help review this too?

jiangxb1987 · 2016-11-07T01:57:03Z

retest this please.

SparkQA · 2016-11-07T04:00:56Z

Test build #68258 has finished for PR 15769 at commit cfcd823.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-07T05:43:53Z

Test build #68265 has finished for PR 15769 at commit 243b8ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-07T06:25:28Z

@rxin I've addressed your comment. Seems the test case is still failing... @ericl

rxin · 2016-11-07T07:37:15Z

I've disabled the test.

SparkQA · 2016-11-07T10:12:35Z

Test build #3417 has finished for PR 15769 at commit 243b8ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Overall looks good, added a few comments though.

mridulm · 2016-11-07T19:58:11Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

+
+      committer.commitJob(jobContext, ret)
+      logInfo(s"Job ${jobContext.getJobID} committed.")
+    } catch { case cause: Throwable =>


nit: case in a new line ?

mridulm · 2016-11-07T20:03:55Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

+      outputMetricsAndBytesWrittenCallback.foreach { case (om, callback) =>
+        om.setBytesWritten(callback())
+        om.setRecordsWritten(recordsWritten)
+      }


This looks like a behavior change - metric's are getting updated even when exceptions are thrown.
Do it after Utils.tryWithSafeFinallyAndFailureCallbacks completes, not in finally

mridulm · 2016-11-07T20:14:08Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

+
+    if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(sparkConf)) {
+      // FileOutputFormat ignores the filesystem parameter
+      val jobFormat = format.newInstance


If it is Configurable, we should invoke setConf on it - the earlier code was doing this.
The tests for that seem to have been removed as well.
Any reason why ?

The logic is to create a OutputCommitter from the given hadoop conf, which is now handled by HadoopMapReduceCommitProtocol, so this code and corresponding tests are no longer needed.
I've search the history code and failed to figure out why we were doing this...

I think I added the comment to the wrong location (though it is relevant here, it is probably less serious ?).
HadoopMapReduceCommitProtocol.setupCommitter() should be doing a setConf if OutputFormat is Configurable.

This needs to be fixed to ensure custom OutputFormat's work.
I see that the PR has already been committed - can you please file a bug and fix it ?
The test will also need to be re-added.

Oh...I see the problem now. Will add that in a follow up, thank you for clarifying.

mridulm · 2016-11-07T20:16:27Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

+    }
+  }
+
+  val RECORDS_BETWEEN_BYTES_WRITTEN_METRIC_UPDATES = 256


private and move it to top of the object ?

jiangxb1987 · 2016-11-08T07:55:11Z

@mridulm I've addressed most of your comments. Thank you!

SparkQA · 2016-11-08T10:18:45Z

Test build #68330 has finished for PR 15769 at commit 9380f91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-08T17:40:25Z

Thanks - merging in master.

…nfigurable`. ## What changes were proposed in this pull request? We should call `setConf` if `OutputFormat` is `Configurable`, this should be done before we create `OutputCommitter` and `RecordWriter`. This is follow up of #15769, see discussion [here](https://github.com/apache/spark/pull/15769/files#r87064229) ## How was this patch tested? Add test of this case in `PairRDDFunctionsSuite`. Author: jiangxingbo <[email protected]> Closes #15823 from jiangxb1987/config-format.

## What changes were proposed in this pull request? This PR port RDD API to use commit protocol, the changes made here: 1. Add new internal helper class that saves an RDD using a Hadoop OutputFormat named `SparkNewHadoopWriter`, it's similar with `SparkHadoopWriter` but uses commit protocol. This class supports the newer `mapreduce` API, instead of the old `mapred` API which is supported by `SparkHadoopWriter`; 2. Rewrite `PairRDDFunctions.saveAsNewAPIHadoopDataset` function, so it uses commit protocol now. ## How was this patch tested? Exsiting test cases. Author: jiangxingbo <[email protected]> Closes apache#15769 from jiangxb1987/rdd-commit.

…nfigurable`. ## What changes were proposed in this pull request? We should call `setConf` if `OutputFormat` is `Configurable`, this should be done before we create `OutputCommitter` and `RecordWriter`. This is follow up of apache#15769, see discussion [here](https://github.com/apache/spark/pull/15769/files#r87064229) ## How was this patch tested? Add test of this case in `PairRDDFunctionsSuite`. Author: jiangxingbo <[email protected]> Closes apache#15823 from jiangxb1987/config-format.

rezasafi · 2017-09-25T18:41:46Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala

+      hadoopConf: Configuration): Unit = {
+    // Extract context and configuration from RDD.
+    val sparkContext = rdd.context
+    val stageId = rdd.id


Is this accurate? Seems weird that satgeId is set to be equal to rdd.id. What is the commit protocol here?

It follows the previous behavior, what's your concern here?

We had a test failure using Spark 2.2 that seems to happen after this commit. The failure didn't occur before using Spark 2.0. We thought there could be a problem since the jobContext passed to commiter.setupJob() (line 84) has a different JobID comparing to the task context that is passed to to commiter.setupTask() (line 126) in the former, it comes from rdd.id and in the later, its from stage.id. Just wanted to check what is the protocol here that requires such a difference? Thanks.

I investigated the problem a little more and I filed a jira & fix, see SPARK-22162. Thank you in advance.

jiangxb1987 added 2 commits November 4, 2016 22:27

port RDD API to use commit protocol.

a0426c8

update comment.

e017e1e

jiangxb1987 commented Nov 4, 2016

View reviewed changes

fix scala style failure.

5e12850

rxin reviewed Nov 4, 2016

View reviewed changes

move SparkNewHadoopWriter to internal/io

4e72745

rxin reviewed Nov 5, 2016

View reviewed changes

jiangxb1987 added 2 commits November 6, 2016 01:16

move logic in saveAsNewAPIHadoopDataset into SparkNewHadoopWriter

eb74e59

bugfix

09d5ed9

bugfix

e5a60ff

bugfix

86f1951

rxin reviewed Nov 5, 2016

View reviewed changes

refactor to improve readability.

cfcd823

rxin reviewed Nov 6, 2016

View reviewed changes

pass the result from commit protocol to commitJob.

243b8ba

mridulm reviewed Nov 7, 2016

View reviewed changes

refactor.

9380f91

asfgit closed this in 9c41969 Nov 8, 2016

jiangxb1987 deleted the rdd-commit branch November 9, 2016 02:03

jiangxb1987 mentioned this pull request Nov 9, 2016

[SPARK-18191][CORE][FOLLOWUP] Call setConf if OutputFormat is Configurable. #15823

Closed

rezasafi reviewed Sep 25, 2017

View reviewed changes

[SPARK-18191][CORE] Port RDD API to use commit protocol #15769

[SPARK-18191][CORE] Port RDD API to use commit protocol #15769

Uh oh!

Conversation

jiangxb1987 commented Nov 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 4, 2016

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Nov 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2016

Uh oh!

SparkQA commented Nov 5, 2016

Uh oh!

jiangxb1987 commented Nov 5, 2016

Uh oh!

SparkQA commented Nov 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2016

Uh oh!

SparkQA commented Nov 6, 2016

Uh oh!

jiangxb1987 commented Nov 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 6, 2016

Uh oh!

rxin commented Nov 6, 2016

Uh oh!

jiangxb1987 commented Nov 7, 2016

Uh oh!

SparkQA commented Nov 7, 2016

Uh oh!

SparkQA commented Nov 7, 2016

Uh oh!

jiangxb1987 commented Nov 7, 2016

Uh oh!

rxin commented Nov 7, 2016

Uh oh!

SparkQA commented Nov 7, 2016

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

rxin Nov 5, 2016 •

edited

Loading

jiangxb1987 Nov 8, 2016 •

edited

Loading

jiangxb1987 commented Nov 8, 2016 •

edited

Loading