Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Jun 12, 2019

What changes were proposed in this pull request?

Convert PartitionedFile.filePath to URI first in binary file data source. Otherwise Spark will throw a FileNotFound exception because we create Path with URL encoded string, instead of wrapping it with URI.

How was this patch tested?

Unit test.

val content = "123".getBytes
Files.write(file.toPath, content, StandardOpenOption.CREATE, StandardOpenOption.WRITE)
val df = spark.read.format(BINARY_FILE).load(dir.getPath)
df.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check that the collect result "path" equal to ".../test space.txt" ?

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment.
Thanks!

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal, good to support spaces. Do we want to maybe support UTF8 and or have tests for supporting fancy things like ☃️.csv?

}

test("SPARK-28030: support chars in file names that require URL encoding") {
withTempDir { dir =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that we only have the space in the file name, or do we need it in the path were providing to trigger SPARK-28030?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test will fail without the patch

}
}

test("SPARK-28030: support chars in file names that require URL encoding") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems to impact not just the binary file format, maybe this belongs in one of our root datasource tests. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still relies on each data source implementation to recognize filePath is actually an URI. I don't see how to test it in our root datasource test. Btw, this PR adds a regression test. So I want to keep the scope minimal.

* that need to be prepended to each row.
*
* @param partitionValues value of partition columns to be prepended to each row.
* @param filePath path of the file to read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe add a comment here that we throw an exception if were passed an invalid URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't throw an exception in the constructor. It depends on how downstream uses it. Validating that it is a valid URI is beyond the scope of this PR.

@mengxr mengxr changed the title [SPARK-28030] convert filePath to URI in binary file data source [SPARK-28030] [SQL] convert filePath to URI in binary file data source Jun 12, 2019
@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106428 has finished for PR 24855 at commit cf96785.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106431 has finished for PR 24855 at commit 2e2b43f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Jun 12, 2019

Merged into master.

@dongjoon-hyun
Copy link
Member

Ur, can we have an explicit LGTM or Looks good to me, @mengxr ?

@asfgit asfgit closed this in 4f4829b Jun 12, 2019
@mengxr
Copy link
Contributor Author

mengxr commented Jun 12, 2019

cc: @WeichenXu123

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

Convert `PartitionedFile.filePath` to URI first in binary file data source. Otherwise Spark will throw a FileNotFound exception because we create `Path` with URL encoded string, instead of wrapping it with URI.

## How was this patch tested?

Unit test.

Closes apache#24855 from mengxr/SPARK-28030.

Authored-by: Xiangrui Meng <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
## What changes were proposed in this pull request?

Convert `PartitionedFile.filePath` to URI first in binary file data source. Otherwise Spark will throw a FileNotFound exception because we create `Path` with URL encoded string, instead of wrapping it with URI.

## How was this patch tested?

Unit test.

Closes apache#24855 from mengxr/SPARK-28030.

Authored-by: Xiangrui Meng <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants