SPARK-1677: allow user to disable output dir existence checking #947

CodingCat · 2014-06-03T04:14:03Z

https://issues.apache.org/jira/browse/SPARK-1677

For compatibility with older versions of Spark it would be nice to have an option spark.hadoop.validateOutputSpecs (default true) for the user to disable the output directory existence checking

AmplabJenkins · 2014-06-03T04:17:58Z

Merged build triggered.

pwendell · 2014-06-03T04:18:00Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

This seems a little backwards... this basically says "if (!shouldValidate) { do validation }" I think maybe this should be:

if (conf.getBoolean("spark.hadoop.validateOutputSpecs", true)...

ah...sorry,

AmplabJenkins · 2014-06-03T04:18:05Z

Merged build started.

AmplabJenkins · 2014-06-03T04:22:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T04:23:05Z

Merged build started.

pwendell · 2014-06-03T04:54:31Z

LGTM pending tests, thanks for adding this. We should put this into 1.0.1 and 1.1

AmplabJenkins · 2014-06-03T04:55:40Z

Merged build finished.

AmplabJenkins · 2014-06-03T04:55:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15372/

pwendell · 2014-06-03T04:59:42Z

I think these test failure are from the first version of the patch.

AmplabJenkins · 2014-06-03T05:00:51Z

Merged build finished.

AmplabJenkins · 2014-06-03T05:00:52Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15373/

pwendell · 2014-06-03T05:03:22Z

core/src/test/scala/org/apache/spark/FileSuite.scala

The conf is imutable once the spark context is created. You need to create a new conf first and then pass it to the SparkContext constructor.

AmplabJenkins · 2014-06-03T09:27:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T09:28:04Z

Merged build started.

AmplabJenkins · 2014-06-03T10:05:57Z

Merged build finished.

AmplabJenkins · 2014-06-03T10:05:57Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15378/

AmplabJenkins · 2014-06-03T10:12:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T10:13:06Z

Merged build started.

AmplabJenkins · 2014-06-03T10:53:49Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-03T10:53:50Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15379/

CodingCat · 2014-06-03T10:54:48Z

done, @pwendell @mateiz , thanks for the comments

CodingCat · 2014-06-05T05:41:38Z

more comments?

pwendell · 2014-06-05T18:36:12Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

Hey the fact that self.conf and conf have the same name is pretty bad and it could easily lead to issues down the road. I realize it's not part of your patch, but would you mind changing the input to be called hadoopConf so that there is no overloading?

def saveAsNewAPIHadoopDataset(hadoopConf: Configuration) { hadoopConf: JobConf = new JobConf(self.context.hadoopConfiguration),

Actually maybe that can be in a separate patch. I can merge this and we can add it later.

https://issues.apache.org/jira/browse/SPARK-1677 For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true) for the user to disable the output directory existence checking Author: CodingCat <[email protected]> Closes #947 from CodingCat/SPARK-1677 and squashes the following commits: 7930f83 [CodingCat] miao c0c0e03 [CodingCat] bug fix and doc update 5318562 [CodingCat] bug fix 13219b5 [CodingCat] allow user to disable output dir existence checking (cherry picked from commit 89cdbb0) Signed-off-by: Patrick Wendell <[email protected]>

pwendell · 2014-06-05T18:41:29Z

Merged into 1.0 and 1.1. I created:
https://issues.apache.org/jira/browse/SPARK-2039

As a follow up to this.

CodingCat · 2014-06-06T01:28:14Z

@pwendell , thanks for merging,

sure, I will work on this,

https://issues.apache.org/jira/browse/SPARK-1677 For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true) for the user to disable the output directory existence checking Author: CodingCat <[email protected]> Closes apache#947 from CodingCat/SPARK-1677 and squashes the following commits: 7930f83 [CodingCat] miao c0c0e03 [CodingCat] bug fix and doc update 5318562 [CodingCat] bug fix 13219b5 [CodingCat] allow user to disable output dir existence checking

This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <[email protected]> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs. (cherry picked from commit 939ba1f) Signed-off-by: Tathagata Das <[email protected]>

This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <[email protected]> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.

yangli907 · 2016-11-23T01:13:06Z

Hi Spark Community,

I'm curious on the behavior of this "spark.hadoop.validateOutputSpecs" option. If I set it to 'false', will existing files in output directory get wiped out beforehand? For example, if spark job is to output file Y under directory A, which already contain file X, do we expect both file X and Y under folder A? Or just Y will be retained after the job completion.

Thanks!

MapR [SPARK-994] Update jackson-mapper-asl v1.9.13 to 1.9.13-atlassian-5

pwendell reviewed Jun 3, 2014
View reviewed changes

allow user to disable output dir existence checking

13219b5

bug fix

5318562

pwendell reviewed Jun 3, 2014
View reviewed changes

bug fix and doc update

c0c0e03

miao

7930f83

pwendell mentioned this pull request Jun 4, 2014

Added java system variable spark.hadoop.checkoutputspec. Set it to false... #958

Closed

pwendell reviewed Jun 5, 2014
View reviewed changes

asfgit closed this in 89cdbb0 Jun 5, 2014

JoshRosen mentioned this pull request Dec 30, 2014

[SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs #3832

Closed

agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022

Merge pull request apache#947 from mapr/SPARK-994-MEP-8.1.0

2d93690

MapR [SPARK-994] Update jackson-mapper-asl v1.9.13 to 1.9.13-atlassian-5

SPARK-1677: allow user to disable output dir existence checking #947

SPARK-1677: allow user to disable output dir existence checking #947

Uh oh!

Conversation

CodingCat commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

pwendell Jun 3, 2014

Choose a reason for hiding this comment

Uh oh!

CodingCat Jun 3, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

pwendell commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

pwendell commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

pwendell Jun 3, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

AmplabJenkins commented Jun 3, 2014

Uh oh!

CodingCat commented Jun 3, 2014

Uh oh!

CodingCat commented Jun 5, 2014

Uh oh!

pwendell Jun 5, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell Jun 5, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell commented Jun 5, 2014

Uh oh!

CodingCat commented Jun 6, 2014

Uh oh!

yangli907 commented Nov 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants