-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21435][SQL] Empty files should be skipped while write to file #18654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon Thanks for you comment, as your mentioned in #18650 and #17395, empty results of parquet can be fixed by leave the first partition, how about the orc format? The orc format error for empty result should also consider together within this patch? |
|
I think ORC can be dealt with separately (the problem is within ORC source given my past investigation). |
|
|
||
| val writeTask = | ||
| if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) { | ||
| if (sparkPartitionId != 0 && !iterator.hasNext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might be okay in that sense we are guaranteed partitions to be started from 0 up to my knowledge. Could you take a look and see if it makes sense to you cc @cloud-fan if you don't mind? I am not confident enough to proceed reviewing and leave a sign-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little hacky but is the simplest fix I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hvanhovell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yhuai too who reviewed my similar PR before.
| class FileFormatWriterSuite extends QueryTest with SharedSQLContext { | ||
|
|
||
| test("empty file should be skipped while write to file") { | ||
| withTempDir { dir => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withTempPath can be used instead I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| dir.delete() | ||
| spark.range(10000).repartition(10).write.parquet(dir.toString) | ||
| val df = spark.read.parquet(dir.toString) | ||
| val allFiles = dir.listFiles(new FilenameFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just do this simpler? for example,
.listFiles().filter { f =>
f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both is ok I think, just copy this from HadoopFsRelationSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. If both are okay, let's go for the shorter one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the shorter one
| !name.startsWith(".") && !name.startsWith("_") | ||
| } | ||
| }) | ||
| assert(allFiles.length == 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask what this test targets? I think I am lost around here ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make sure the source dir have many files, and the output dir only have 2 files.
If this make people confuse, just leave a notes and delete the assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I guess this one (the latter) does not test this change? If this test passes regardless of this PR change, I would rather remove this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll remove this assert and leave a note.
|
|
||
| withTempDir { dst_dir => | ||
| dst_dir.delete() | ||
| df.where("id = 50").write.parquet(dst_dir.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would explicitly repartition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need repartition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking just in order to make sure the (previous) number of files written out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean.. for example, if we happen to have few partitions in the df in any event, I guess this test can become invalid ...
|
Test build #79667 has finished for PR 18654 at commit
|
|
retest this please |
What is the meta we need to write? |
|
schema and the footer in case of Parquet. There is more context here - #17395 (comment). For example, if we don't write out the empty files, it breaks: spark.range(100).filter("id > 100").write.parquet("/tmp/abc")
spark.read.parquet("/tmp/abc").show() |
|
Yep, empty result dir need this meta, otherwise will throw the exception: |
| !name.startsWith(".") && !name.startsWith("_") | ||
| } | ||
| }) | ||
| // First partition file and the data file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we only need the first partition file if all other partitions are empty, but this is hard to do right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't agree more, firstly I try to implement like this but the FileFormatWriter.write can only see the iterator of each task self.
|
Test build #79687 has finished for PR 18654 at commit
|
| class FileFormatWriterSuite extends QueryTest with SharedSQLContext { | ||
|
|
||
| test("empty file should be skipped while write to file") { | ||
| withTempPath { dir => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe just do as below?
withTempPath { path =>
spark.range(100).repartition(10).where("id = 50").write.parquet(path)
val partFiles = path.listFiles()
.filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_"))
assert(partFiles.length === 2)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More clear :) No need to create source files in real.
|
Test build #79694 has finished for PR 18654 at commit
|
|
Test build #79695 has finished for PR 18654 at commit
|
|
retest this please |
|
Test build #79700 has finished for PR 18654 at commit
|
|
retest this please |
|
Test build #79704 has finished for PR 18654 at commit
|
|
ping @cloud-fan @HyukjinKwon |
|
LGTM, merging to master! |
What changes were proposed in this pull request?
Add EmptyDirectoryWriteTask for empty task while writing files. Fix the empty result for parquet format by leaving the first partition for meta writing.
How was this patch tested?
Add new test in
FileFormatWriterSuite