-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12220][Core]Make Utils.fetchFile support files that contain special characters #10208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -331,6 +331,30 @@ private[spark] object Utils extends Logging { | |
| } | ||
|
|
||
| /** | ||
| * A file name may contain some invalid URI characters, such as " ". This method will convert the | ||
| * file name to a raw path accepted by `java.net.URI(String)`. | ||
| * | ||
| * Note: the file name must not contain "/" or "\" | ||
| */ | ||
| def encodeFileNameToURIRawPath(fileName: String): String = { | ||
| require(!fileName.contains("/") && !fileName.contains("\\")) | ||
| // `file` and `localhost` are not used. Just to prevent URI from parsing `fileName` as | ||
| // scheme or host. The prefix "/" is required because URI doesn't accept a relative path. | ||
| // We should remove it after we get the raw path. | ||
| new URI("file", null, "localhost", -1, "/" + fileName, null, null).getRawPath.substring(1) | ||
| } | ||
|
|
||
| /** | ||
| * Get the file name from uri's raw path and decode it. If the raw path of uri ends with "/", | ||
| * return the name before the last "/". | ||
| */ | ||
| def decodeFileNameInURI(uri: URI): String = { | ||
| val rawPath = uri.getRawPath | ||
| val rawFileName = rawPath.split("/").last | ||
| new URI("file:///" + rawFileName).getPath.substring(1) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't get this function -- the URI would already have methods to return the 'non-raw' path right?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I want to split the raw path before decoding it, so that it can handle uris that contain
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see -- I think you don't need to go to a URI, back to string, back to URI here. This may not even need a method for the one-liner: As an aside, some app servers will be picky about serving URIs with %2F in some places, since this has been used in the past for some security exploits, to disguise a cheeky request for some local file URL that would otherwise be caught by (faulty) logic in the app that's not thinking about escaped sequences and trying to handle raw paths. I think Tomcat won't for example. It may be overkill but might even be worth considering conservatively rejecting such a URI anyway.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I created a method here so that I can write unit tests for this one to confirm the desired behavior. For the security issue of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but isn't that an argument that you don't need this method (or at least you don't need more than
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
the uri here may contains more than a file name, such as |
||
| } | ||
|
|
||
| /** | ||
| * Download a file or directory to target directory. Supports fetching the file in a variety of | ||
| * ways, including HTTP, Hadoop-compatible filesystems, and files on a standard filesystem, based | ||
| * on the URL parameter. Fetching directories is only supported from Hadoop-compatible | ||
|
|
@@ -351,7 +375,7 @@ private[spark] object Utils extends Logging { | |
| hadoopConf: Configuration, | ||
| timestamp: Long, | ||
| useCache: Boolean) { | ||
| val fileName = url.split("/").last | ||
| val fileName = decodeFileNameInURI(new URI(url)) | ||
| val targetFile = new File(targetDir, fileName) | ||
| val fetchCacheEnabled = conf.getBoolean("spark.files.useFetchCache", defaultValue = true) | ||
| if (useCache && fetchCacheEnabled) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
Utils.resolveURIalready covers this functionality.(It's "URI" and "URL" not "uri" and "url".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.resolveURIcannot handle some file names, such asabc:xyz.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that file name is a problem in a URL though.
:is not reserved in that part. You'd have to send this tojava.net.URI"manually" like you've done, or else usefile:/abc:xyz. Then it's doing the same thing; your method wouldn't escape the colon either.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to encode
colon. The purpose here is: assume we have a uri prefix:spark://localhost:1234/and a file name:abc xyz, and want to concat the prefix and the file name tospark://localhost:1234/abc%20xyzso that we can pass it tojava.net.URI(String).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, but the existing method already does just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I pass
abc:xyzto URI.resolveURI, it doesn't work:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it must be
file:/abc:xyzsince you need to tell it it needs to be treated as a file name. Meh, at this point why not justnew URI("file:/" + fileName).getRawPath.substring(1)and keep it simple?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter of
URI(String)must be an encoded path. But herefileNameis what we want to encode.