-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-34809][CORE] Enable spark.hadoopRDD.ignoreEmptySplits by default #31909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you review this PR, @HyukjinKwon , @viirya , @attilapiros ? Personally, I tested this in my repo first. |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. For behavior change, it is now documented and only for 3.2.
|
Thank you, @viirya ! |
|
Thanks for cc'ing me @dongjoon-hyun. |
|
Kubernetes integration test starting |
|
There were some PRs reverted (e.g., #13181 and SPARK-15393) that make me wary .. but I think the change here is fine ... |
|
Kubernetes integration test status failure |
|
Thank you, @HyukjinKwon . Ya, I remember those commits. :) |
|
Also, cc @mridulm since this is a behavior change. |
|
Test build #136290 has finished for PR 31909 at commit
|
|
The failure is irrelevant. |
|
@bozhang2820 the test failure seems related to your PR 86ea520. Yeah I can also confirm that the test failure is not related to this PR |
|
retest this please |
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked the existing tests:
- spark.hadoopRDD.ignoreEmptySplits work correctly (old Hadoop API)
- spark.hadoopRDD.ignoreEmptySplits work correctly (new Hadoop API)
So the spark.hadoopRDD.ignoreEmptySplits=true is covered by unit tests and it is pretty straightforward what happens there.
I have checked whether in the mailing lists are there any mention of this config but there was none (so there was no problems / concerns reported regarding this).
So as the behaviour change is documented and the feature is solid:
LGTM
|
Thank you so much, @attilapiros ! |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136299 has finished for PR 31909 at commit
|
|
Merged to master for Apache Spark 3.2.0. |
|
late LGTM |
|
late LGTM, thanks for working on this @dongjoon-hyun ! |
|
Thank you, @mridulm and @cloud-fan . |
… `SymlinkTextInputSplit` bug ### What changes were proposed in this pull request? This PR is a follow-up for #31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results. This is due to a combination of problems: 1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split. 2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty. Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark. The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used. ### Why are the changes needed? Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces the issue and verified that it passes with the fix. Closes #38277 from sadikovi/fix-symlink-input-format. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… `SymlinkTextInputSplit` bug ### What changes were proposed in this pull request? This PR is a follow-up for apache#31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results. This is due to a combination of problems: 1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split. 2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty. Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark. The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used. ### Why are the changes needed? Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces the issue and verified that it passes with the fix. Closes apache#38277 from sadikovi/fix-symlink-input-format. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR aims to enable
spark.hadoopRDD.ignoreEmptySplitsby default for Apache Spark 3.2.0.Why are the changes needed?
Although this is a safe improvement, this hasn't been enabled by default to avoid the explicit behavior change. This PR aims to switch the default explicitly in Apache Spark 3.2.0.
Does this PR introduce any user-facing change?
Yes, the behavior change is documented.
How was this patch tested?
Pass the existing CIs.