-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17577][SparkR][Core] SparkR support add files to Spark job and get by executors #15131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d3dd380 to
5c49428
Compare
|
Test build #65540 has finished for PR 15131 at commit
|
|
Test build #65539 has finished for PR 15131 at commit
|
|
It looks like cc @HyukjinKwon who worked on this for |
|
@shivaram Thanks for cc'ing me. I will try to look closely within today. |
|
I just took a look. The problematic code is here, SparkContext.scala#L1429. We should not directly use scala> new java.net.URI("C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\RtmpegI4mr\\hello92023051e13.txt")
java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\Users\appveyor\AppData\Local\Temp\1\RtmpegI4mr\hello92023051e13.txt
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3058)
at java.net.URI.<init>(URI.java:588)
... 33 elidedI took a look the APIs taking I can open a separate PR, post a PR to @yanboliang's forked repo or just let you fix this in here. Please let me know. |
|
@HyukjinKwon @shivaram Thanks for your comments. I will fix the URI issue in this PR as you suggested. |
|
Hm.. It seems it does not trigger the build because the last change does not include R changes. I manually ran this after checking out your PR in my forked repo - https://ci.appveyor.com/project/HyukjinKwon/spark/build/98-15131-pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I should've checked the codes further. It seems Utils.fetchFile is asking url as the first argument not path. How about changing val uri = new URI(path) to val uri = new Path(path).toUri rather than resembling addJar?
I remember we discussed this problem in #14960. If comma separated multiple paths are allowed in path, it will be problematic but it seems the path is only allowed as a single.
In this case, it'd be safe to use val uri = new Path(path).toUri.
import org.apache.hadoop.fs.Path
scala> new Path("C:\\a\\b\\c").toUri
res1: java.net.URI = /C:/a/b/c
scala> new Path("C:/a/b/c").toUri
res2: java.net.URI = /C:/a/b/c
scala> new Path("/a/b/c").toUri
res3: java.net.URI = /a/b/c
scala> new Path("file:///a/b/c").toUri
res3: java.net.URI = file:///a/b/c
scala> new Path("http://localhost/a/b/c").toUri
res3: java.net.URI = http://localhost/a/b/cThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(@yanboliang BTW - I don't mind triggering builds manually. Please feel free to submit more commits for test purposes if you will)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for your suggestion, will update it soon.
|
Test build #65564 has finished for PR 15131 at commit
|
R/pkg/R/context.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this param is not in below?
|
Could you explain the goal of this a bit more?
|
|
@felixcheung I'd like to make SparkR users can call corresponding function equivalent of |
|
@yanboliang |
42a17e5 to
542b981
Compare
|
(FYI, I ran another build - https://ci.appveyor.com/project/HyukjinKwon/spark/build/103-15131-pr) |
| val uri = new Path(path).toUri | ||
| val schemeCorrectedPath = uri.getScheme match { | ||
| case null | "local" => new File(path).getCanonicalFile.toURI.toString | ||
| case _ => path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Utils.fetchFile(path, ...) below, it seems we can't pass path as it is because new URI(path) is called internally which fails to be parsed in case of Windows path.
Could I please ask to change this to uri.toString? It'd work fine as far as I know.
import java.net.URI
import org.apache.hadoop.fs.Path
scala> val a = new Path("C:\\a\\b\\c").toUri
a: java.net.URI = /C:/a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = /C:/a/b/c
scala> val a = new Path("C:/a/b/c").toUri
a: java.net.URI = /C:/a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = /C:/a/b/c
scala> val a = new Path("/a/b/c").toUri
a: java.net.URI = /a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = /a/b/c
scala> val a = new Path("file:///a/b/c").toUri
a: java.net.URI = file:///a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = file:///a/b/c
scala> val a = new Path("http://localhost/a/b/c").toUri
a: java.net.URI = http://localhost/a/b/c
scala> val b = new URI(a.toString)
b: java.net.URI = http://localhost/a/b/cThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I ran the tests after manually fixing this. Maybe we can wait for the result - https://ci.appveyor.com/project/HyukjinKwon/spark/build/108-pr-15131-path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang Yeap, it passes the tests at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
|
Test build #65570 has finished for PR 15131 at commit
|
|
Cool, I just ran again just in case :) https://ci.appveyor.com/project/HyukjinKwon/spark/build/110-pr-check-15131 |
|
Test build #65571 has finished for PR 15131 at commit
|
|
Test build #65580 has finished for PR 15131 at commit
|
|
Jenkins, test this please. |
|
Test build #65590 has finished for PR 15131 at commit
|
|
@shivaram @felixcheung @HyukjinKwon Any thoughts? Thanks! |
|
just a thought - I think names like "spark.getSparkFilesRootDirectory" is a bit verbose and perhaps repetitive. Is "SparkFile" a term? Could this be "spark.getFileDir" and "spark.getFiles" or "spark.list.files" (like this) |
| #' @rdname spark.getSparkFiles | ||
| #' @param fileName The name of the file added through spark.addFile | ||
| #' @return the absolute path of a file added through spark.addFile. | ||
| #' @examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @export
| #' | ||
| #' @rdname spark.getSparkFilesRootDirectory | ||
| #' @return the root directory that contains files added through spark.addFile | ||
| #' @examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @export
| #' @rdname spark.addFile | ||
| #' @param path The path of the file to be added | ||
| #' @examples | ||
| #'\dontrun{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @export
| */ | ||
| def addFile(path: String, recursive: Boolean): Unit = { | ||
| val uri = new URI(path) | ||
| val uri = new Path(path).toUri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be some tests we can add for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand your concern @felixcheung. However, IMHO, it'd be okay to not test Hadoop library within Spark. I will try to find some tests/documentation related with Windows path in Hadoop and then will share to make sure.
FWIW, this case was verified by one of comitters before for Windows path. So, it'd be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we could alternatively use Utils.resolveURI as well which is already being tested within Spark. However, this util seems not hadling C:/a/b/c case (not C:\a\b\c) which we should fix. So I suggested Path (...).toUri instead but if you feel strongly about this, we could use that. I will try to find and share doc and tests for Path as I get in my home though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked the documentation and tests. It seems Windows paths are being tested [1]. As toString [2] it based on URI instance in Path which toUri [3] will return directly, I guess it'd be safe.
I could not find the explicit mention about Windows path in the documentation [4].
[1] https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestPath.java
[2] https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java#L365-L391
[3] https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java#L291
[4] https://hadoop.apache.org/docs/stable2/api/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @HyukjinKwon . Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks!
| // SparkFiles API to access files. | ||
| Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, | ||
| hadoopConfiguration, timestamp, useCache = false) | ||
| Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be some tests we can add for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my peraonal opinion, I thought it's okay (I thought about this for a while) because Utils.fetchFile wants the first argument as url not path. So, this might be a valid correction without tests because we fixed the argument to what the function initially wants. But no strong opinion.
|
@felixcheung I totally understand your concern about the naming, but I found we can not use |
|
Test build #65698 has finished for PR 15131 at commit
|
|
Jenkins, test this please. |
|
Test build #65705 has finished for PR 15131 at commit
|
|
I see. I think SparkFiles itself isn't really well documented but it's good to be consistent with Scala and Python, even though we don't have a class here. |
|
I will merge this into master. If anyone has more comments, I can address them at follow up work. Thanks for your review. @felixcheung @HyukjinKwon @shivaram |
|
Hi @yanboliang , do you mind if I ask the changes in |
|
@HyukjinKwon Sounds good. Do you think only backport the URI related change is OK? |
|
Yes, I think so. We might be able to cc @sarutak because I see we were concerned of this change and it'd be nicer if we have a sign-off from anothrt committer (also I would like to let him know about this). We can also cc him in the backport pr maybe. |
|
Opened branch-2.0 backport PR at #15217. Thanks! |
… it work well on Windows ## What changes were proposed in this pull request? Update ```SparkContext.addFile``` to correct the use of ```URI``` and ```Path```, then it can work well on Windows. This is used for branch-2.0 backport, more details at #15131. ## How was this patch tested? Backport, checked by appveyor. Author: Yanbo Liang <[email protected]> Closes #15217 from yanboliang/uri-2.0.
What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options
--filesorSparkContext.addFile(). Meanwhile, users can get the added file bySparkFiles.get(filename).We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by
install.packages.How was this patch tested?
Add unit test.