Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Mar 27, 2019

What changes were proposed in this pull request?

In #23130, all empty files are excluded from target file splits in FileSourceScanExec.
In File source V2, we should keep the same behavior.

This PR suggests to filter out empty files on listing files in PartitioningAwareFileIndex so that the upper level doesn't need to handle them.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the filtered out empty files can impact on calculation of maxSplitBytes:

val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
. Probably, we need to handle the case there or filter empty files earlier.

@SparkQA
Copy link

SparkQA commented Mar 27, 2019

Test build #104005 has finished for PR 24227 at commit 664fe9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-27291][SQL] File source V2: Ignore empty files in load [SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files Mar 27, 2019
@gengliangwang
Copy link
Member Author

gengliangwang commented Mar 27, 2019

@MaxGekk Thanks for the suggestion. I have updated the PR.
My only concern is that the metrics "numFiles" will be affected. The empty files won't be taken into account. But it should be fine since Spark doesn't read these empty files.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 49b0411 Mar 27, 2019
@SparkQA
Copy link

SparkQA commented Mar 27, 2019

Test build #104016 has finished for PR 24227 at commit 4bf5cec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
… on listing files

In apache#23130, all empty files are excluded from target file splits in `FileSourceScanExec`.
In File source V2, we should keep the same behavior.

This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them.

Unit test

Closes apache#24227 from gengliangwang/ignoreEmptyFile.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants