Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Sep 17, 2016

What changes were proposed in this pull request?

Scala/Python users can add files to Spark job by submit options --files or SparkContext.addFile(). Meanwhile, users can get the added file by SparkFiles.get(filename).
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by install.packages.

How was this patch tested?

Add unit test.

@SparkQA
Copy link

SparkQA commented Sep 17, 2016

Test build #65540 has finished for PR 15131 at commit 5c49428.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2016

Test build #65539 has finished for PR 15131 at commit d3dd380.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

It looks like addFile isn't working on Windows because we try to convert the windows file path into a URI and that fails. Not sure what the fix is in this case.

cc @HyukjinKwon who worked on this for hadoopFile

@HyukjinKwon
Copy link
Member

@shivaram Thanks for cc'ing me. I will try to look closely within today.

@HyukjinKwon
Copy link
Member

I just took a look. The problematic code is here, SparkContext.scala#L1429.

We should not directly use new URI with Windows path because the paths such as C:\a\b\c is not the valid URI as below:

scala> new java.net.URI("C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\RtmpegI4mr\\hello92023051e13.txt")
java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\Users\appveyor\AppData\Local\Temp\1\RtmpegI4mr\hello92023051e13.txt
  at java.net.URI$Parser.fail(URI.java:2848)
  at java.net.URI$Parser.checkChars(URI.java:3021)
  at java.net.URI$Parser.parse(URI.java:3058)
  at java.net.URI.<init>(URI.java:588)
  ... 33 elided

I took a look the APIs taking path as an argument in SparkContext and it seems this one is only the case. It seems the similar case is being handled in SparkContext.scala#L1702-L1707.

I can open a separate PR, post a PR to @yanboliang's forked repo or just let you fix this in here. Please let me know.

@yanboliang
Copy link
Contributor Author

@HyukjinKwon @shivaram Thanks for your comments. I will fix the URI issue in this PR as you suggested.

@yanboliang yanboliang changed the title [SPARK-17577][SparkR] SparkR support add files to Spark job and get by executors [SPARK-17577][SparkR][Core] SparkR support add files to Spark job and get by executors Sep 18, 2016
@HyukjinKwon
Copy link
Member

Hm.. It seems it does not trigger the build because the last change does not include R changes. I manually ran this after checking out your PR in my forked repo - https://ci.appveyor.com/project/HyukjinKwon/spark/build/98-15131-pr

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I should've checked the codes further. It seems Utils.fetchFile is asking url as the first argument not path. How about changing val uri = new URI(path) to val uri = new Path(path).toUri rather than resembling addJar?

I remember we discussed this problem in #14960. If comma separated multiple paths are allowed in path, it will be problematic but it seems the path is only allowed as a single.

In this case, it'd be safe to use val uri = new Path(path).toUri.

import org.apache.hadoop.fs.Path

scala> new Path("C:\\a\\b\\c").toUri
res1: java.net.URI = /C:/a/b/c

scala> new Path("C:/a/b/c").toUri
res2: java.net.URI = /C:/a/b/c

scala> new Path("/a/b/c").toUri
res3: java.net.URI = /a/b/c

scala> new Path("file:///a/b/c").toUri
res3: java.net.URI = file:///a/b/c

scala> new Path("http://localhost/a/b/c").toUri
res3: java.net.URI = http://localhost/a/b/c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@yanboliang BTW - I don't mind triggering builds manually. Please feel free to submit more commits for test purposes if you will)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for your suggestion, will update it soon.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65564 has finished for PR 15131 at commit 42a17e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this param is not in below?

@felixcheung
Copy link
Member

Could you explain the goal of this a bit more?
It looks like:

  1. these are not exported functions from the R package
  2. these are not documented (@noRd) in generated API doc
  3. generally in R we avoid naming methods with . because of S3 class method naming convention for class
  4. we are also deprecating calls to sparkR.init and removing requirement of sc as parameter

@yanboliang
Copy link
Contributor Author

@felixcheung I'd like to make SparkR users can call corresponding function equivalent of SparkContext.addFile, SparkFiles.get and SparkFiles.getRootDirectory, then users can add their own files to Spark job and get/read them from executors. I'm imitating other functions such as setCheckpointDir in context.R to implement these functions. Would you mind to give me some suggestion for better naming or other issues? Thanks.

@felixcheung
Copy link
Member

@yanboliang
Most methods in context.R are not exported (except spark.lapply and setLogLevel) and therefore they are not publically accessible from the R package.
Does this work with an existing SparkContext/SparkSession? if yes, one approach would be to follow spark.lapply or setLogLevel in getting the current SparkContext.
(yes, I think the current approach with public methods named sparkR.* and spark.* is a bit messy and has the problem with S3 convention as mentioned above; generally we try to have sparkR.* for initialization stuff and spark.* for Spark specific, non-initialization stuff)

@HyukjinKwon
Copy link
Member

(FYI, I ran another build - https://ci.appveyor.com/project/HyukjinKwon/spark/build/103-15131-pr)

val uri = new Path(path).toUri
val schemeCorrectedPath = uri.getScheme match {
case null | "local" => new File(path).getCanonicalFile.toURI.toString
case _ => path
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Utils.fetchFile(path, ...) below, it seems we can't pass path as it is because new URI(path) is called internally which fails to be parsed in case of Windows path.

Could I please ask to change this to uri.toString? It'd work fine as far as I know.

import java.net.URI
import org.apache.hadoop.fs.Path

scala> val a = new Path("C:\\a\\b\\c").toUri
a: java.net.URI = /C:/a/b/c

scala> val b = new URI(a.toString)
b: java.net.URI = /C:/a/b/c

scala> val a = new Path("C:/a/b/c").toUri
a: java.net.URI = /C:/a/b/c

scala> val b = new URI(a.toString)
b: java.net.URI = /C:/a/b/c

scala> val a = new Path("/a/b/c").toUri
a: java.net.URI = /a/b/c

scala> val b = new URI(a.toString)
b: java.net.URI = /a/b/c

scala> val a = new Path("file:///a/b/c").toUri
a: java.net.URI = file:///a/b/c

scala> val b = new URI(a.toString)
b: java.net.URI = file:///a/b/c

scala> val a = new Path("http://localhost/a/b/c").toUri
a: java.net.URI = http://localhost/a/b/c

scala> val b = new URI(a.toString)
b: java.net.URI = http://localhost/a/b/c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I ran the tests after manually fixing this. Maybe we can wait for the result - https://ci.appveyor.com/project/HyukjinKwon/spark/build/108-pr-15131-path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang Yeap, it passes the tests at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65570 has finished for PR 15131 at commit 542b981.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Cool, I just ran again just in case :) https://ci.appveyor.com/project/HyukjinKwon/spark/build/110-pr-check-15131

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65571 has finished for PR 15131 at commit fa82b3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65580 has finished for PR 15131 at commit acfbd8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65590 has finished for PR 15131 at commit acfbd8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@shivaram @felixcheung @HyukjinKwon Any thoughts? Thanks!

@felixcheung
Copy link
Member

just a thought - I think names like "spark.getSparkFilesRootDirectory" is a bit verbose and perhaps repetitive. Is "SparkFile" a term? Could this be "spark.getFileDir" and "spark.getFiles" or "spark.list.files" (like this)

#' @rdname spark.getSparkFiles
#' @param fileName The name of the file added through spark.addFile
#' @return the absolute path of a file added through spark.addFile.
#' @examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @export

#'
#' @rdname spark.getSparkFilesRootDirectory
#' @return the root directory that contains files added through spark.addFile
#' @examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @export

#' @rdname spark.addFile
#' @param path The path of the file to be added
#' @examples
#'\dontrun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @export

*/
def addFile(path: String, recursive: Boolean): Unit = {
val uri = new URI(path)
val uri = new Path(path).toUri
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some tests we can add for this change?

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand your concern @felixcheung. However, IMHO, it'd be okay to not test Hadoop library within Spark. I will try to find some tests/documentation related with Windows path in Hadoop and then will share to make sure.

FWIW, this case was verified by one of comitters before for Windows path. So, it'd be okay.

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we could alternatively use Utils.resolveURI as well which is already being tested within Spark. However, this util seems not hadling C:/a/b/c case (not C:\a\b\c) which we should fix. So I suggested Path (...).toUri instead but if you feel strongly about this, we could use that. I will try to find and share doc and tests for Path as I get in my home though.

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @HyukjinKwon . Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks!

// SparkFiles API to access files.
Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager,
hadoopConfiguration, timestamp, useCache = false)
Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some tests we can add for this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my peraonal opinion, I thought it's okay (I thought about this for a while) because Utils.fetchFile wants the first argument as url not path. So, this might be a valid correction without tests because we fixed the argument to what the function initially wants. But no strong opinion.

@yanboliang
Copy link
Contributor Author

@felixcheung I totally understand your concern about the naming, but I found we can not use spark.getFileDir and spark.getFiles. Since the SparkFiles is a term which used to describe the files added by addFile and can be shared between driver/executors specially. Spark has other kinds of files or directories that can be get by users, so I don't think we can simplify the naming. You can refer the definition of SparkFiles. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65698 has finished for PR 15131 at commit 9ed3c68.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65705 has finished for PR 15131 at commit 9ed3c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I see. I think SparkFiles itself isn't really well documented but it's good to be consistent with Scala and Python, even though we don't have a class here.
I'm ok with this.

@yanboliang
Copy link
Contributor Author

I will merge this into master. If anyone has more comments, I can address them at follow up work. Thanks for your review. @felixcheung @HyukjinKwon @shivaram

@asfgit asfgit closed this in c133907 Sep 22, 2016
@yanboliang yanboliang deleted the spark-17577 branch September 22, 2016 03:12
@HyukjinKwon
Copy link
Member

Hi @yanboliang , do you mind if I ask the changes in SaprkCotext is something we should backport?

@yanboliang
Copy link
Contributor Author

@HyukjinKwon Sounds good. Do you think only backport the URI related change is OK?

@HyukjinKwon
Copy link
Member

Yes, I think so. We might be able to cc @sarutak because I see we were concerned of this change and it'd be nicer if we have a sign-off from anothrt committer (also I would like to let him know about this). We can also cc him in the backport pr maybe.

@yanboliang
Copy link
Contributor Author

yanboliang commented Sep 23, 2016

Opened branch-2.0 backport PR at #15217. Thanks!

asfgit pushed a commit that referenced this pull request Sep 23, 2016
… it work well on Windows

## What changes were proposed in this pull request?
Update ```SparkContext.addFile``` to correct the use of ```URI``` and ```Path```, then it can work well on Windows. This is used for branch-2.0 backport, more details at #15131.

## How was this patch tested?
Backport, checked by appveyor.

Author: Yanbo Liang <[email protected]>

Closes #15217 from yanboliang/uri-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants