-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files #24227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the filtered out empty files can impact on calculation of maxSplitBytes:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala
Line 92 in c0632ce
| val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum |
|
Test build #104005 has finished for PR 24227 at commit
|
|
@MaxGekk Thanks for the suggestion. I have updated the PR. |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
thanks, merging to master! |
|
Test build #104016 has finished for PR 24227 at commit
|
… on listing files In apache#23130, all empty files are excluded from target file splits in `FileSourceScanExec`. In File source V2, we should keep the same behavior. This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them. Unit test Closes apache#24227 from gengliangwang/ignoreEmptyFile. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In #23130, all empty files are excluded from target file splits in
FileSourceScanExec.In File source V2, we should keep the same behavior.
This PR suggests to filter out empty files on listing files in
PartitioningAwareFileIndexso that the upper level doesn't need to handle them.How was this patch tested?
Unit test