Skip to content

Conversation

@sun-rui
Copy link
Contributor

@sun-rui sun-rui commented Jun 10, 2015

This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.

This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.

This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.

This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.

R/install-dev.bat is not tested. @shivaram , Could you help to test it?

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34593 has finished for PR 6743 at commit 528f30e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 10, 2015

Why is this needed if the only thing it is using is the DataFrame API?

@rxin
Copy link
Contributor

rxin commented Jun 10, 2015

ah i see. this is only for the AM.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34663 has finished for PR 6743 at commit 8d2a8df.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here as to why we have this '#sparkr" ? I believe this is to get the archive to unzip to a symlink named sparkr ?

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 12, 2015

Yes this assigns a symbol link name. Thus we can refer to the shipped package via the logical name instead of the specific archive file name.

@shivaram
Copy link
Contributor

Thanks @sun-rui -- I didn't get a chance to test this on windows (and a YARN cluster yet). Will try to do it this weekend.

Also it looks like there is a merge conflict. Could you resolve that ?

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34779 has finished for PR 6743 at commit 77055af.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 13, 2015

rebased

@SparkQA
Copy link

SparkQA commented Jun 13, 2015

Test build #34834 has finished for PR 6743 at commit c8ed3d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sun-rui -- this should be jar.exe instead of jar. The other thing is that jar.exe is only available in the JDK and not in the JAR version. So sometimes this may not be in the PATH. There are a couple of options for things we can do here

  1. We can use %JAVA_HOME%\bin\jar.exe -- This might be more safer as users need to set JAVA_HOME for the compilation to work correctly
  2. Rtools [1] by default installs a zip utility [2] as zip.exe. At least on my machine running zip.exe -r sparkr.zip SparkR seems to work.

[1] http://cran.r-project.org/bin/windows/Rtools/
[2] http://www.info-zip.org/

@SparkQA
Copy link

SparkQA commented Jun 14, 2015

Test build #34886 has finished for PR 6743 at commit 2a8f7e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2015

Test build #34887 has finished for PR 6743 at commit e57c0f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@sun-rui Thanks for the update. I just tested this on a YARN cluster and things seem to work correctly for the use case where we create a data frame from a file (i.e read.df).

However the createDataFrame code path still requires the RRDD to pick up worker.R from the zip file. Is there a separate JIRA we have for that ?

cc @davies

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 15, 2015

I originally planned to have a separate JIRA issue for adding shipping of the SparkR package for RDD APIs. But if this is still required by DataFrame API, I can do it in this PR.
Just wonder to know why all test pass? Not enough test cases for createDataFrame()?

@shivaram
Copy link
Contributor

The tests pass because the createDataFrame call only fails in the YARN cluster mode and we don't have tests for SparkR in the YarnSuite (Actually this deserves a new JIRA).

BTW the change needed for createDataFrame to work should be minimal -- I think we can just set the variable SPARKR_PACKAGE_DIR for all modes and then set the default value of sparkRLibDir[1] to be Sys.getenv(SPARKR_PACKAGE_DIR).

[1] https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L103

@tgravescs
Copy link
Contributor

I'm curious, have you looked at ways of shipping R itself with the job or are you relying on R being installed on all the yarn nodes? Should be possible with distributed cache just wondering if anyone has done it or looked at making it automatic

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #35989 has finished for PR 6743 at commit 0925e2a.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 29, 2015

@tgravescs, I think the problem of shipping R itself is that R executable is platform specific. Also it may require OS specific installation before running R (not sure). pySpark also does not ship python itself.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #35990 has finished for PR 6743 at commit 86b3fa9.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 29, 2015

Add support for shipping SparkR package for R workers required by RDD APIs. Tested createDataFrame() by creating a DataFrame from an R list.

Remove sparkRLibDir parameter of sparkR.init(). Determine SparkR package location on each worker node according to the deployment mode (this allows node-specific SPARK_HOME). Not sure if there is better solution. A rough code scan about pySpark does not tell me how pySpark locates pySpark.zip in various deployment modes. @davies , could you help to give me a hint and review this patch?

Next, I'd like to refactor this code to align with SPARK-5479 (moves YARN specific code from SparkSubmit to deploy/yarn) @shivaram, do you think I refactor code in this patch or do it in a new JIRA?

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 29, 2015

rebased

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #35994 has finished for PR 6743 at commit c6ef550.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks @sun-rui for the update. @davies can probably confirm but AFAIK the PySpark location is picked up at [1] by looking at the JAR file path / Spark home.

I'll take a closer look at the code later today, but in terms of refactoring I'd say lets do in a separate JIRA.

[1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31

@davies
Copy link
Contributor

davies commented Jun 29, 2015

@sun-rui Unfortunately, I also don't know much about how PySpark run on YARN. Could you add some unit test for YARN mode? (just follow the Python ones)

@sun-rui
Copy link
Contributor Author

sun-rui commented Jun 30, 2015

@shivaram, yes, I saw that function, but felt confusing that it does not consider the YARN mode case.
@davies, it seems unit tests for pySpark in YARN modes were added in https://github.com/apache/spark/pull/6360/files. Need confirmation.

@shivaram
Copy link
Contributor

cc @andrewor14 who probably knows some thing about the YARN tests

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36801 has finished for PR 6743 at commit c644746.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Jul 8, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36804 has finished for PR 6743 at commit c644746.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@sun-rui could you rebase this? There are conflicts now. @shivaram does the R side look good to you? If so feel free to merge it.

@shivaram
Copy link
Contributor

Yeah R code looks fine to me. However as this changes some fundamental things with how we locate the package, @sun-rui it will be great if you could confirm the two or three scenarios work correctly (with just a local master or a standalone master is fine):

  1. Using shell by running bin/sparkR
  2. Running a script using bin/spark-submit or bin/sparkR
  3. Launching SparkR from RStudio

If things look good lets merge this once its rebased + tested. We can do more testing while its in the tree before 1.5

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #37127 has finished for PR 6743 at commit ca63c86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Jul 13, 2015

@shivaram , tests done. Also tested with YARN cluster, yarn-client, standalone, createDataFrame() in YARN client mode.

@shivaram
Copy link
Contributor

Thanks @sun-rui - LGTM. Merging this

sun-rui pushed a commit that referenced this pull request Jul 13, 2015
This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.

This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.

This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.

This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.

R/install-dev.bat is not tested. shivaram , Could you help to test it?

Author: Sun Rui <[email protected]>

Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits:

ca63c86 [Sun Rui] Adjust MimaExcludes after rebase.
7313374 [Sun Rui] Fix unit test errors.
72695fb [Sun Rui] Fix unit test failures.
193882f [Sun Rui] Fix Mima test error.
fe25a33 [Sun Rui] Fix Mima test error.
35ecfa3 [Sun Rui] Fix comments.
c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes.
b05340c [Sun Rui] Fix scala style.
2ca5048 [Sun Rui] Fix comments.
1acefd1 [Sun Rui] Fix scala style.
0aa1e97 [Sun Rui] Fix scala style.
41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs.
49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat.
7b916c5 [Sun Rui] Use 'rem' consistently.
3bed438 [Sun Rui] Add a comment.
681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes.
cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
@shivaram
Copy link
Contributor

@sun-rui Could you close this PR ? Looks like Github PRs are not being closed due to an infrastructure issue https://issues.apache.org/jira/browse/INFRA-9988

@sun-rui
Copy link
Contributor Author

sun-rui commented Jul 14, 2015

@shivaram, close the PR.

@sun-rui sun-rui closed this Jul 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants