-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6797][SPARKR] Add support for YARN cluster mode. #6743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #34593 has finished for PR 6743 at commit
|
|
Why is this needed if the only thing it is using is the DataFrame API? |
|
ah i see. this is only for the AM. |
|
Test build #34663 has finished for PR 6743 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here as to why we have this '#sparkr" ? I believe this is to get the archive to unzip to a symlink named sparkr ?
|
Yes this assigns a symbol link name. Thus we can refer to the shipped package via the logical name instead of the specific archive file name. |
|
Thanks @sun-rui -- I didn't get a chance to test this on windows (and a YARN cluster yet). Will try to do it this weekend. Also it looks like there is a merge conflict. Could you resolve that ? |
|
Test build #34779 has finished for PR 6743 at commit
|
|
rebased |
|
Test build #34834 has finished for PR 6743 at commit
|
R/install-dev.bat
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun-rui -- this should be jar.exe instead of jar. The other thing is that jar.exe is only available in the JDK and not in the JAR version. So sometimes this may not be in the PATH. There are a couple of options for things we can do here
- We can use
%JAVA_HOME%\bin\jar.exe-- This might be more safer as users need to set JAVA_HOME for the compilation to work correctly - Rtools [1] by default installs a zip utility [2] as
zip.exe. At least on my machine runningzip.exe -r sparkr.zip SparkRseems to work.
[1] http://cran.r-project.org/bin/windows/Rtools/
[2] http://www.info-zip.org/
|
Test build #34886 has finished for PR 6743 at commit
|
|
Test build #34887 has finished for PR 6743 at commit
|
|
@sun-rui Thanks for the update. I just tested this on a YARN cluster and things seem to work correctly for the use case where we create a data frame from a file (i.e read.df). However the cc @davies |
|
I originally planned to have a separate JIRA issue for adding shipping of the SparkR package for RDD APIs. But if this is still required by DataFrame API, I can do it in this PR. |
|
The tests pass because the BTW the change needed for [1] https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L103 |
|
I'm curious, have you looked at ways of shipping R itself with the job or are you relying on R being installed on all the yarn nodes? Should be possible with distributed cache just wondering if anyone has done it or looked at making it automatic |
|
Test build #35989 has finished for PR 6743 at commit
|
|
@tgravescs, I think the problem of shipping R itself is that R executable is platform specific. Also it may require OS specific installation before running R (not sure). pySpark also does not ship python itself. |
|
Test build #35990 has finished for PR 6743 at commit
|
|
Add support for shipping SparkR package for R workers required by RDD APIs. Tested createDataFrame() by creating a DataFrame from an R list. Remove sparkRLibDir parameter of sparkR.init(). Determine SparkR package location on each worker node according to the deployment mode (this allows node-specific SPARK_HOME). Not sure if there is better solution. A rough code scan about pySpark does not tell me how pySpark locates pySpark.zip in various deployment modes. @davies , could you help to give me a hint and review this patch? Next, I'd like to refactor this code to align with SPARK-5479 (moves YARN specific code from SparkSubmit to deploy/yarn) @shivaram, do you think I refactor code in this patch or do it in a new JIRA? |
|
rebased |
|
Test build #35994 has finished for PR 6743 at commit
|
|
@sun-rui Unfortunately, I also don't know much about how PySpark run on YARN. Could you add some unit test for YARN mode? (just follow the Python ones) |
|
@shivaram, yes, I saw that function, but felt confusing that it does not consider the YARN mode case. |
|
cc @andrewor14 who probably knows some thing about the YARN tests |
|
Test build #36801 has finished for PR 6743 at commit
|
|
Jenkins, retest this please |
|
Test build #36804 has finished for PR 6743 at commit
|
|
Yeah R code looks fine to me. However as this changes some fundamental things with how we locate the package, @sun-rui it will be great if you could confirm the two or three scenarios work correctly (with just a local master or a standalone master is fine):
If things look good lets merge this once its rebased + tested. We can do more testing while its in the tree before 1.5 |
|
Test build #37127 has finished for PR 6743 at commit
|
|
@shivaram , tests done. Also tested with YARN cluster, yarn-client, standalone, createDataFrame() in YARN client mode. |
|
Thanks @sun-rui - LGTM. Merging this |
This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node. This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed. This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue. This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue. R/install-dev.bat is not tested. shivaram , Could you help to test it? Author: Sun Rui <[email protected]> Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits: ca63c86 [Sun Rui] Adjust MimaExcludes after rebase. 7313374 [Sun Rui] Fix unit test errors. 72695fb [Sun Rui] Fix unit test failures. 193882f [Sun Rui] Fix Mima test error. fe25a33 [Sun Rui] Fix Mima test error. 35ecfa3 [Sun Rui] Fix comments. c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes. b05340c [Sun Rui] Fix scala style. 2ca5048 [Sun Rui] Fix comments. 1acefd1 [Sun Rui] Fix scala style. 0aa1e97 [Sun Rui] Fix scala style. 41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs. 49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat. 7b916c5 [Sun Rui] Use 'rem' consistently. 3bed438 [Sun Rui] Add a comment. 681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes. cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
|
@sun-rui Could you close this PR ? Looks like Github PRs are not being closed due to an infrastructure issue https://issues.apache.org/jira/browse/INFRA-9988 |
|
@shivaram, close the PR. |
This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.
This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.
This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.
This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.
R/install-dev.bat is not tested. @shivaram , Could you help to test it?