[SPARK-34809][CORE] Enable spark.hadoopRDD.ignoreEmptySplits by default #31909

dongjoon-hyun · 2021-03-21T00:17:35Z

What changes were proposed in this pull request?

This PR aims to enable spark.hadoopRDD.ignoreEmptySplits by default for Apache Spark 3.2.0.

Why are the changes needed?

Although this is a safe improvement, this hasn't been enabled by default to avoid the explicit behavior change. This PR aims to switch the default explicitly in Apache Spark 3.2.0.

Does this PR introduce any user-facing change?

Yes, the behavior change is documented.

How was this patch tested?

Pass the existing CIs.

dongjoon-hyun · 2021-03-21T00:50:26Z

Could you review this PR, @HyukjinKwon , @viirya , @attilapiros ? Personally, I tested this in my repo first.

https://github.com/dongjoon-hyun/spark/pull/54

viirya

Sounds good to me. For behavior change, it is now documented and only for 3.2.

dongjoon-hyun · 2021-03-21T01:10:48Z

Thank you, @viirya !

HyukjinKwon · 2021-03-21T01:40:36Z

Thanks for cc'ing me @dongjoon-hyun.
Also cc @jiangxb1987 and @cloud-fan from #19504

SparkQA · 2021-03-21T01:42:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40872/

HyukjinKwon · 2021-03-21T01:47:30Z

There were some PRs reverted (e.g., #13181 and SPARK-15393) that make me wary .. but I think the change here is fine ...

SparkQA · 2021-03-21T01:50:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40872/

dongjoon-hyun · 2021-03-21T02:34:47Z

Thank you, @HyukjinKwon . Ya, I remember those commits. :)

dongjoon-hyun · 2021-03-21T02:39:35Z

Also, cc @mridulm since this is a behavior change.

SparkQA · 2021-03-21T02:44:49Z

Test build #136290 has finished for PR 31909 at commit e34c5ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-03-21T05:00:00Z

The failure is irrelevant.

[info] - SPARK-34757: should ignore cache for SNAPSHOT dependencies *** FAILED *** (1 second, 254 milliseconds)
[info]   0 equaled 0 (SparkSubmitUtilsSuite.scala:321)

HyukjinKwon · 2021-03-21T05:24:36Z

@bozhang2820 the test failure seems related to your PR 86ea520. Yeah I can also confirm that the test failure is not related to this PR

HyukjinKwon · 2021-03-21T05:24:52Z

retest this please

attilapiros

I have checked the existing tests:

spark.hadoopRDD.ignoreEmptySplits work correctly (old Hadoop API)
spark.hadoopRDD.ignoreEmptySplits work correctly (new Hadoop API)

So the spark.hadoopRDD.ignoreEmptySplits=true is covered by unit tests and it is pretty straightforward what happens there.

I have checked whether in the mailing lists are there any mention of this config but there was none (so there was no problems / concerns reported regarding this).

So as the behaviour change is documented and the feature is solid:
LGTM

dongjoon-hyun · 2021-03-21T06:06:21Z

Thank you so much, @attilapiros !

SparkQA · 2021-03-21T06:43:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40881/

SparkQA · 2021-03-21T06:51:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40881/

SparkQA · 2021-03-21T08:26:32Z

Test build #136299 has finished for PR 31909 at commit e34c5ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-03-21T21:35:08Z

Merged to master for Apache Spark 3.2.0.
Thank you all!

cloud-fan · 2021-03-22T06:51:01Z

late LGTM

mridulm · 2021-03-25T18:17:06Z

late LGTM, thanks for working on this @dongjoon-hyun !
This will impact custom map side join, etc ... but given it can be re-enabled by flag, it is fine.

dongjoon-hyun · 2021-03-25T20:38:04Z

Thank you, @mridulm and @cloud-fan .

… `SymlinkTextInputSplit` bug ### What changes were proposed in this pull request? This PR is a follow-up for #31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results. This is due to a combination of problems: 1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split. 2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty. Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark. The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used. ### Why are the changes needed? Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces the issue and verified that it passes with the fix. Closes #38277 from sadikovi/fix-symlink-input-format. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… `SymlinkTextInputSplit` bug ### What changes were proposed in this pull request? This PR is a follow-up for apache#31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results. This is due to a combination of problems: 1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split. 2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty. Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark. The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used. ### Why are the changes needed? Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces the issue and verified that it passes with the fix. Closes apache#38277 from sadikovi/fix-symlink-input-format. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-34809][CORE] Enable spark.hadoopRDD.ignoreEmptySplits by default

e34c5ae

github-actions bot added CORE DOCS labels Mar 21, 2021

viirya approved these changes Mar 21, 2021

View reviewed changes

attilapiros approved these changes Mar 21, 2021

View reviewed changes

dongjoon-hyun closed this in 3bc6fe4 Mar 21, 2021

dongjoon-hyun deleted the SPARK-34809 branch March 21, 2021 21:35

sadikovi mentioned this pull request Oct 17, 2022

[SPARK-40815][SQL] Add DelegateSymlinkTextInputFormat to workaround SymlinkTextInputSplit bug #38277

Closed

[SPARK-34809][CORE] Enable spark.hadoopRDD.ignoreEmptySplits by default #31909

[SPARK-34809][CORE] Enable spark.hadoopRDD.ignoreEmptySplits by default #31909

Uh oh!

Conversation

dongjoon-hyun commented Mar 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Mar 21, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 21, 2021

Uh oh!

HyukjinKwon commented Mar 21, 2021

Uh oh!

SparkQA commented Mar 21, 2021

Uh oh!

HyukjinKwon commented Mar 21, 2021

Uh oh!

SparkQA commented Mar 21, 2021

Uh oh!

dongjoon-hyun commented Mar 21, 2021

Uh oh!

dongjoon-hyun commented Mar 21, 2021

Uh oh!

SparkQA commented Mar 21, 2021

Uh oh!

dongjoon-hyun commented Mar 21, 2021

Uh oh!

HyukjinKwon commented Mar 21, 2021

Uh oh!

HyukjinKwon commented Mar 21, 2021

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 21, 2021

Uh oh!

SparkQA commented Mar 21, 2021

Uh oh!

SparkQA commented Mar 21, 2021

Uh oh!

SparkQA commented Mar 21, 2021

Uh oh!

dongjoon-hyun commented Mar 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Mar 22, 2021

Uh oh!

mridulm commented Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dongjoon-hyun commented Mar 21, 2021 •

edited

Loading

dongjoon-hyun commented Mar 21, 2021 •

edited

Loading

mridulm commented Mar 25, 2021 •

edited

Loading