Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 21, 2021

What changes were proposed in this pull request?

This PR aims to enable spark.hadoopRDD.ignoreEmptySplits by default for Apache Spark 3.2.0.

Why are the changes needed?

Although this is a safe improvement, this hasn't been enabled by default to avoid the explicit behavior change. This PR aims to switch the default explicitly in Apache Spark 3.2.0.

Does this PR introduce any user-facing change?

Yes, the behavior change is documented.

How was this patch tested?

Pass the existing CIs.

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @HyukjinKwon , @viirya , @attilapiros ? Personally, I tested this in my repo first.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. For behavior change, it is now documented and only for 3.2.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya !

@HyukjinKwon
Copy link
Member

Thanks for cc'ing me @dongjoon-hyun.
Also cc @jiangxb1987 and @cloud-fan from #19504

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40872/

@HyukjinKwon
Copy link
Member

There were some PRs reverted (e.g., #13181 and SPARK-15393) that make me wary .. but I think the change here is fine ...

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40872/

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon . Ya, I remember those commits. :)

@dongjoon-hyun
Copy link
Member Author

Also, cc @mridulm since this is a behavior change.

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Test build #136290 has finished for PR 31909 at commit e34c5ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The failure is irrelevant.

[info] - SPARK-34757: should ignore cache for SNAPSHOT dependencies *** FAILED *** (1 second, 254 milliseconds)
[info]   0 equaled 0 (SparkSubmitUtilsSuite.scala:321)

@HyukjinKwon
Copy link
Member

@bozhang2820 the test failure seems related to your PR 86ea520. Yeah I can also confirm that the test failure is not related to this PR

@HyukjinKwon
Copy link
Member

retest this please

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked the existing tests:

  • spark.hadoopRDD.ignoreEmptySplits work correctly (old Hadoop API)
  • spark.hadoopRDD.ignoreEmptySplits work correctly (new Hadoop API)

So the spark.hadoopRDD.ignoreEmptySplits=true is covered by unit tests and it is pretty straightforward what happens there.

I have checked whether in the mailing lists are there any mention of this config but there was none (so there was no problems / concerns reported regarding this).

So as the behaviour change is documented and the feature is solid:
LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @attilapiros !

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40881/

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40881/

@SparkQA
Copy link

SparkQA commented Mar 21, 2021

Test build #136299 has finished for PR 31909 at commit e34c5ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 21, 2021

Merged to master for Apache Spark 3.2.0.
Thank you all!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-34809 branch March 21, 2021 21:35
@cloud-fan
Copy link
Contributor

late LGTM

@mridulm
Copy link
Contributor

mridulm commented Mar 25, 2021

late LGTM, thanks for working on this @dongjoon-hyun !
This will impact custom map side join, etc ... but given it can be re-enabled by flag, it is fine.

@dongjoon-hyun
Copy link
Member Author

Thank you, @mridulm and @cloud-fan .

dongjoon-hyun pushed a commit that referenced this pull request Oct 31, 2022
… `SymlinkTextInputSplit` bug

### What changes were proposed in this pull request?

This PR is a follow-up for #31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results.

This is due to a combination of problems:
1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split.
2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty.

Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark.

The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used.

### Why are the changes needed?

Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test that reproduces the issue and verified that it passes with the fix.

Closes #38277 from sadikovi/fix-symlink-input-format.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
… `SymlinkTextInputSplit` bug

### What changes were proposed in this pull request?

This PR is a follow-up for apache#31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results.

This is due to a combination of problems:
1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split.
2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty.

Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark.

The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used.

### Why are the changes needed?

Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test that reproduces the issue and verified that it passes with the fix.

Closes apache#38277 from sadikovi/fix-symlink-input-format.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants