Skip to content

Conversation

@CodingCat
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-2039

apply output dir existence checking for all output formats

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15793/

@pwendell
Copy link
Contributor

I think the title should say: SPARK-2039 instead of SPARK-2309

@pwendell
Copy link
Contributor

LGTM - I can merge this and I'll just fix the title.

@asfgit asfgit closed this in 716c88a Jun 16, 2014
@CodingCat
Copy link
Contributor Author

@pwendell , ah, sorry for the mistake

thanks for fixing this

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
https://issues.apache.org/jira/browse/SPARK-2039

apply output dir existence checking for all output formats

Author: CodingCat <[email protected]>

Closes apache#1088 from CodingCat/SPARK-2039 and squashes the following commits:

c52747a [CodingCat] apply output dir existence checking for all output formats
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
https://issues.apache.org/jira/browse/SPARK-2039

apply output dir existence checking for all output formats

Author: CodingCat <[email protected]>

Closes apache#1088 from CodingCat/SPARK-2039 and squashes the following commits:

c52747a [CodingCat] apply output dir existence checking for all output formats
asfgit pushed a commit that referenced this pull request Jan 5, 2015
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.

Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists.  SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.

In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times.  In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions.  When output spec. validation is enabled, the second calls to these actions will fail due to existing output.

This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler.  This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.

Author: Josh Rosen <[email protected]>

Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:

36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.

(cherry picked from commit 939ba1f)
Signed-off-by: Tathagata Das <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 5, 2015
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.

Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists.  SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.

In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times.  In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions.  When output spec. validation is enabled, the second calls to these actions will fail due to existing output.

This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler.  This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.

Author: Josh Rosen <[email protected]>

Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:

36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
wangyum added a commit that referenced this pull request May 26, 2023
* HandleOuterJoinBuildSideSkew

* fix

* handleOuterJoinBuildSideSkew

* Check optimize tag

* fix

* Update SQLConf.scala

* Update SQLConf.scala
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants