[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio #14784

zjffdu · 2016-08-24T07:43:36Z

What changes were proposed in this pull request?

Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).

    if (args.isR && clusterManager == YARN) {
      val sparkRPackagePath = RUtils.localSparkRPackagePath
      if (sparkRPackagePath.isEmpty) {
        printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
      }
      val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
      if (!sparkRPackageFile.exists()) {
        printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
      }
      val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString

      // Distribute the SparkR package.
      // Assigns a symbol link name "sparkr" to the shipped package.
      args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")

      // Distribute the R package archive containing all the built R packages.
      if (!RUtils.rPackages.isEmpty) {
        val rPackageFile =
          RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
        if (!rPackageFile.exists()) {
          printErrorAndExit("Failed to zip all the built R packages.")
        }

        val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
        // Assigns a symbol link name "rpkg" to the shipped package.
        args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
      }
    }

So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.

How was this patch tested?

Verify it manually in R Studio using the following code.

Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)

…

SparkQA · 2016-08-24T07:51:16Z

Test build #64339 has finished for PR 14784 at commit 20c36c4.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-24T08:40:30Z

Test build #64340 has finished for PR 14784 at commit 635d98d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-08-24T08:46:03Z

@shivaram @felixcheung Could you help review it ? Thanks

felixcheung · 2016-08-24T10:27:16Z

R/pkg/R/sparkR.R

as you can see on L240, "master" is passed to createSparkContext. If this is not being set as spark.master, could you track down what's going on?

It is passed to createSparkContext, but it is too late as it should be passed to JVM when it is started, otherwise sparkr.zip is not added to args. archives as the above code shows. Another approach is to duplicate the code of SparkSubmit.scala in createSparkContext.

ok I see. So then starting programmatically with master set to cluster manager has been broken from early on then.
@shivaram I think we need to port this fix to 2.0.1?

also I'd move this

if (nzchar(master)) { assign("spark.master", master, envir = sparkConfigMap)

to above L357, and change as

if (nzchar(master)) { sparkConfigMap[["spark.master"]] <- master

This is because of the user has explicitly set
sparkR.session(spark.master = "yarn-client") it should override whatever master is (that's in L361).

@felixcheung sorry, I don't get what you mean. L357 will override whatever master is, that's why I make the change after that, so that we can pass the correct master to sparkConfigMap

Right, I think that could go either way

scala> val a = SparkSession.builder().master("yarn").config("spark.master", "local[3]").getOrCreate() scala> a.conf.get("master") java.util.NoSuchElementException: master scala> a.conf.get("spark.master") res8: String = local[3]

I think the current R implementation mimic the programmatic behavior rather than commandline.

We could certainly change it - just want to highlight this is a breaking change if we are to silently flip the precedence order, and would think we should avoid unless if we have reason to.

That just seems like an ordering issue ? For example

> val a = SparkSession.builder().config("spark.master", "local[3]").master("yarn").getOrCreate() > a.conf.get("spark.master") res1: String = yarn

It is - we are turning the Scala Builder syntax of multiple function calls into one in R; we just pick the order that most prominently featured which is

SparkSession.builder().master("yarn").config("spark.master", "local[3]")

I think this is not a critical issue as it is very rare to happen. We can skip it for now. And I think it is better to do in scala side (maybe throw exception in this scenario)

@shivaram @felixcheung Any more comments ?

SparkQA · 2016-08-29T19:39:07Z

Test build #64566 has finished for PR 14784 at commit 986cddc.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

shivaram · 2016-09-08T17:17:23Z

@zjffdu Sorry for the delay. I think the change looks pretty good. We could just add a logWarning in case we find a collision ? Also could you bring this up to date with the master branch ?

felixcheung · 2016-09-08T17:30:54Z

@shivaram @zjffdu my preference would be to keep the existing precedence order, from our discussion above.

If we think we should change it at this point, then we should be handling everything consistently, in that other explicitly parameter should also take precedence this way, this includes: master, appName, and possibly with previously unmapped parameters like sparkHome, enableHiveSupport, sparkJars (==spark.jars), sparkPackages (==spark.jars.packages?). And we would need to update the roxygen2 doc on the change.

… run sparkr in RStudio

zjffdu · 2016-09-19T04:04:51Z

@shivaram @felixcheung Sorry for late response, I just rebase the PR and also take spark.master over master. Please help review.

SparkQA · 2016-09-19T04:37:54Z

Test build #65589 has finished for PR 14784 at commit e635164.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-19T10:31:01Z

R/pkg/R/sparkR.R

I'm a bit confused. To be concrete I think we were talking about something like:

sparkConfigMap <- convertNamedListToEnv(sparkConfig) namedParams <- list(...) if (length(namedParams) > 0) { paramMap <- convertNamedListToEnv(namedParams) # Override for certain named parameters if (exists("spark.master", envir = paramMap)) { master <- paramMap[["spark.master"]] } if (exists("spark.app.name", envir = paramMap)) { appName <- paramMap[["spark.app.name"]] } overrideEnvs(sparkConfigMap, paramMap) } if (nzchar(master)) { sparkConfigMap[["spark.master"]] <- master }

felixcheung · 2016-09-19T10:31:46Z

R/pkg/R/sparkR.R

I think this is from an outdated code - could you check your merge with master or edit this manually?

SparkQA · 2016-09-20T03:46:55Z

Test build #65626 has finished for PR 14784 at commit c91d02a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-21T14:22:03Z

LGTM. Thanks for finding and fixing this problem.

felixcheung · 2016-09-23T18:38:44Z

merged this to master and branch-2.0

… running sparkr in RStudio ## What changes were proposed in this pull request? Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala). ``` if (args.isR && clusterManager == YARN) { val sparkRPackagePath = RUtils.localSparkRPackagePath if (sparkRPackagePath.isEmpty) { printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") } val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE) if (!sparkRPackageFile.exists()) { printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") } val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString // Distribute the SparkR package. // Assigns a symbol link name "sparkr" to the shipped package. args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr") // Distribute the R package archive containing all the built R packages. if (!RUtils.rPackages.isEmpty) { val rPackageFile = RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE) if (!rPackageFile.exists()) { printErrorAndExit("Failed to zip all the built R packages.") } val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString // Assigns a symbol link name "rpkg" to the shipped package. args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg") } } ``` So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster. ## How was this patch tested? Verify it manually in R Studio using the following code. ``` Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) library(SparkR) sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1")) df <- as.DataFrame(mtcars) head(df) ``` … Author: Jeff Zhang <[email protected]> Closes #14784 from zjffdu/SPARK-17210. (cherry picked from commit f62ddc5) Signed-off-by: Felix Cheung <[email protected]>

felixcheung reviewed Aug 24, 2016
View reviewed changes

zjffdu added 3 commits September 19, 2016 11:05

[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when…

6ecb04c

… run sparkr in RStudio

fix code style

d3469e6

address comments

6968775

zjffdu force-pushed the SPARK-17210 branch from 986cddc to e635164 Compare September 19, 2016 04:03

felixcheung requested changes Sep 19, 2016

View reviewed changes

rebase PR and take spark.master over master

c91d02a

zjffdu force-pushed the SPARK-17210 branch from e635164 to c91d02a Compare September 20, 2016 03:09

asfgit closed this in f62ddc5 Sep 23, 2016

[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio #14784

[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio #14784

Uh oh!

Conversation

zjffdu commented Aug 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

zjffdu commented Aug 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zjffdu Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Aug 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Aug 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

shivaram commented Sep 8, 2016

Uh oh!

felixcheung commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjffdu commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

felixcheung commented Sep 21, 2016

Uh oh!

felixcheung commented Sep 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zjffdu Aug 24, 2016 •

edited

Loading

felixcheung Aug 25, 2016 •

edited

Loading

felixcheung Aug 25, 2016 •

edited

Loading

felixcheung commented Sep 8, 2016 •

edited

Loading

felixcheung Sep 19, 2016 •

edited

Loading