-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26161][SQL] Ignore empty files in load #23130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
36c64f0
e428b83
4212add
2458882
b200a50
e7871f3
7057f8b
1f58cc1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -18,6 +18,8 @@ | |||||||||||||||
| package org.apache.spark.sql.sources | ||||||||||||||||
|
|
||||||||||||||||
| import java.io.File | ||||||||||||||||
| import java.nio.charset.StandardCharsets | ||||||||||||||||
| import java.nio.file.{Files, Paths} | ||||||||||||||||
|
|
||||||||||||||||
| import org.scalatest.BeforeAndAfter | ||||||||||||||||
|
|
||||||||||||||||
|
|
@@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA | |||||||||||||||
| assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) | ||||||||||||||||
| } | ||||||||||||||||
| } | ||||||||||||||||
|
|
||||||||||||||||
| test("skip empty files in non bucketed read") { | ||||||||||||||||
| withTempDir { dir => | ||||||||||||||||
| val path = dir.getCanonicalPath | ||||||||||||||||
| Files.write(Paths.get(path, "empty"), Array.empty[Byte]) | ||||||||||||||||
| Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) | ||||||||||||||||
| val readback = spark.read.option("wholetext", true).text(path) | ||||||||||||||||
|
|
||||||||||||||||
| assert(readback.rdd.getNumPartitions === 1) | ||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this test fail without your change? IIUC one partition can read multiple files. Is JSON the only data source that may return a row for empty file?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, it does due to the
We depend on underlying parser here. I will check CSV and Text.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you mean
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so, spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala Line 57 in 46110a5
This can guarantee ( in text datasources at least) one file -> one partition.
Do you mean this code? spark/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Lines 459 to 464 in 8c68718
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for pointing it out, I think we are good here. |
||||||||||||||||
| } | ||||||||||||||||
| } | ||||||||||||||||
| } | ||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the filtering inside the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test case for this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean changing
filter...map...toflatMap? I don't have a strong preference about it.The updated test cases and the new test case are for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer filter + map as it's shorter and clearer. I don't know if one is faster; two transformations vs having to return Some/None. For a Dataset operation I'd favor one operation, but this is just local Scala code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's non-critical path in terms of performance. Should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
createBucketedReadRDDis for the bucket table, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and the same change is also in
createNonBucketedReadRDD