-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28030] [SQL] convert filePath to URI in binary file data source #24855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val content = "123".getBytes | ||
| Files.write(file.toPath, content, StandardOpenOption.CREATE, StandardOpenOption.WRITE) | ||
| val df = spark.read.format(BINARY_FILE).load(dir.getPath) | ||
| df.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a check that the collect result "path" equal to ".../test space.txt" ?
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment.
Thanks!
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal, good to support spaces. Do we want to maybe support UTF8 and or have tests for supporting fancy things like ☃️.csv?
...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| test("SPARK-28030: support chars in file names that require URL encoding") { | ||
| withTempDir { dir => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok that we only have the space in the file name, or do we need it in the path were providing to trigger SPARK-28030?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test will fail without the patch
| } | ||
| } | ||
|
|
||
| test("SPARK-28030: support chars in file names that require URL encoding") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems to impact not just the binary file format, maybe this belongs in one of our root datasource tests. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still relies on each data source implementation to recognize filePath is actually an URI. I don't see how to test it in our root datasource test. Btw, this PR adds a regression test. So I want to keep the scope minimal.
| * that need to be prepended to each row. | ||
| * | ||
| * @param partitionValues value of partition columns to be prepended to each row. | ||
| * @param filePath path of the file to read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe add a comment here that we throw an exception if were passed an invalid URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't throw an exception in the constructor. It depends on how downstream uses it. Validating that it is a valid URI is beyond the scope of this PR.
|
Test build #106428 has finished for PR 24855 at commit
|
|
Test build #106431 has finished for PR 24855 at commit
|
|
Merged into master. |
|
Ur, can we have an explicit LGTM or |
|
cc: @WeichenXu123 |
## What changes were proposed in this pull request? Convert `PartitionedFile.filePath` to URI first in binary file data source. Otherwise Spark will throw a FileNotFound exception because we create `Path` with URL encoded string, instead of wrapping it with URI. ## How was this patch tested? Unit test. Closes apache#24855 from mengxr/SPARK-28030. Authored-by: Xiangrui Meng <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>
## What changes were proposed in this pull request? Convert `PartitionedFile.filePath` to URI first in binary file data source. Otherwise Spark will throw a FileNotFound exception because we create `Path` with URL encoded string, instead of wrapping it with URI. ## How was this patch tested? Unit test. Closes apache#24855 from mengxr/SPARK-28030. Authored-by: Xiangrui Meng <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>
What changes were proposed in this pull request?
Convert
PartitionedFile.filePathto URI first in binary file data source. Otherwise Spark will throw a FileNotFound exception because we createPathwith URL encoded string, instead of wrapping it with URI.How was this patch tested?
Unit test.