-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12220][Core]Make Utils.fetchFile support files that contain special characters #10208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @vanzin |
|
LGTM pending tests. |
|
Test build #47373 has finished for PR 10208 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Utils.resolveURI already covers this functionality.
(It's "URI" and "URL" not "uri" and "url".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.resolveURI cannot handle some file names, such as abc:xyz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that file name is a problem in a URL though. : is not reserved in that part. You'd have to send this to java.net.URI "manually" like you've done, or else use file:/abc:xyz. Then it's doing the same thing; your method wouldn't escape the colon either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to encode colon. The purpose here is: assume we have a uri prefix: spark://localhost:1234/ and a file name: abc xyz, and want to concat the prefix and the file name to spark://localhost:1234/abc%20xyz so that we can pass it to java.net.URI(String).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, but the existing method already does just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I pass abc:xyz to URI.resolveURI, it doesn't work:
scala> Utils.resolveURI("abc:xyz").getRawPath
res0: String = null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it must be file:/abc:xyz since you need to tell it it needs to be treated as a file name. Meh, at this point why not just new URI("file:/" + fileName).getRawPath.substring(1) and keep it simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh, at this point why not just new URI("file:/" + fileName).getRawPath.substring(1) and keep it simple?
The parameter of URI(String) must be an encoded path. But here fileName is what we want to encode.
|
retest this please |
|
Test build #47437 has started for PR 10208 at commit |
|
retest this please |
|
Test build #47529 has finished for PR 10208 at commit
|
|
@srowen further comments? |
|
This feels hacky to be partly encoding URIs with ad-hoc methods. I understand the problem but this seems to be making the project code more complex. The methods seems like overkill since you're just calling a single (tested) JDK method. shrug |
|
@srowen Just found |
|
LGTM, will merge once tests pass. |
|
yeah if it doesn't mean adding a dependency on HttpClient into core -- it may happen to come in transitively already |
|
retest this please |
|
Test build #47675 has finished for PR 10208 at commit
|
|
retest this please |
|
Test build #47683 has finished for PR 10208 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only accidentally works since the library is a transitive dependency of core. I'm hesitant to make it a direct dependency and make core yet bigger, but in theory it already is (indirectly). It's not otherwise used in Spark though.
I know we've been around a lot on this; I had though using the JDK URI class directly was simplest rather than drag in a dependency. If anything I was questioning wrapping it up in a method, but, that's not that bad.
Any other opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything I was questioning wrapping it up in a method, but, that's not that bad.
Do you suggest that reverting my last commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before you consider doing more work to change again, I'd love to get another opinion if possible. We won't stall too long on this in any event. I suppose I preferred your original change, yes, to this one, just because it avoids any new subtle dependency entanglement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider explicitly adding that dependency? If it's already done somewhere else then it would be best that we don't re-implement the same thing and potentially introduce our own bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey. Explicitly added it.
|
Test build #47767 has finished for PR 10208 at commit
|
|
retest this please |
|
Test build #47786 has finished for PR 10208 at commit
|
|
The problem is, nothing else in Spark uses HttpClient. I don't think it's worth it just for this, especially since there's essentially a one-liner using JDK methods. I agree that, if we could reuse a significant set of methods from HttpClient and retire some Spark code, that'd make it worth it. That's a bit of a separate issue. Still there, I know HttpClient versions can be a bit of a problem since lots of things depend on various versions of 3.x and 4.x, and it's easy to hit version compatibility problems. |
|
I suppose the decision is what we want to maintain more: our own encode/decode logic, or potential dependency conflicts. If we test the encode/decode logic well then we probably don't need to maintain it much, but at the same time it's good to use a trusted library in case there's something we missed. Though I agree the dependency thing could be more difficult to maintain. I'm OK one way or the other, so we can just do it ourselves as @srowen prefers. |
|
Okey. I just removed my last 2 commits. |
|
retest this please |
|
Test build #47847 has finished for PR 10208 at commit
|
|
@srowen do the latest changes LGTY? |
|
OK. I think this pretty fine. |
|
Great. Merging to master and 1.6 |
…pecial characters This PR encodes and decodes the file name to fix the issue. Author: Shixiong Zhu <[email protected]> Closes #10208 from zsxwing/uri. (cherry picked from commit 86e405f) Signed-off-by: Shixiong Zhu <[email protected]>
This PR encodes and decodes the file name to fix the issue.