-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15159][SPARKR] SparkR SparkSession API #13635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to open a JIRA on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/SPARK-16027 for this
|
Test build #60384 has finished for PR 13635 at commit
|
|
Test build #60385 has finished for PR 13635 at commit
|
|
Test build #60388 has finished for PR 13635 at commit
|
|
Thanks @felixcheung - I'll take a look at this today. cc @rxin |
R/pkg/NAMESPACE
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the naming is accurate, this seems more verbose than before. So one option that i've been thinking about is what if we broke some backward compatibility and returned a SparkSession object from sparkR.init
The other option is to just call it sparkR.session and sparkR.session.stop
@rxin Any thoughts on this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I think it's important to have session in there and thought it would better to be more explicit about the getOrCreate behavior but as commented that name was unusual in R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while SparkSession is the main entry, SparkContext does remain. My thoughts are:
- Keep sparkR.init() as is. It still returns the SparkContext.
- Add new API like SparkRSession.init(), and sparkRSession.stop().
SparkRSession.init() has two forms:
A. it can accept a SparkContext as a parameter, and no other Spark Configurations.
B. Just like the current sparkR.session.getOrCreate(), it internally creates SparkContext. - Keep sparkRSQL.init() and sparkRHive.init() for backward compatibility, while they are updated to call SparkRSession.init().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is very limited use for SparkContext in R - do you really think we should keep/expose it?
and most of what you are suggesting is already implemented as such :) I think except sparkR.init() is deprecated (but still works for back compatibility)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I dont think we have any methods which accept a spark context in fact. So hiding it / not having an API where we can pass a SparkContext around is fine by me.
|
Thanks @felixcheung for the PR. Other than the naming issues, I think the code changes look pretty good to me. I think there are some more docs, programming guide changes we'll need to make but I think we can do them in a follow up JIRA -- given that RC1 is quite close. cc @sun-rui |
|
@shivaram, I probably take a look at this tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to print a info message here that hive classes are not found and that we are creating a SparkContext without hive support
107dbef to
2ed0788
Compare
|
updated per feedback + more tests + rebased to master |
|
Test build #60688 has finished for PR 13635 at commit
|
R/pkg/R/SQLContext.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but I think we need to deprecate dropTempTable and call it dropTempView in SparkR as well ? cc @liancheng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. There are several API changes related to catalog that we are not changing here, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open a JIRA on this, I have the fix.
|
@felixcheung Thanks for the update. The change looks pretty good to me. I think there are 2-3 follow up JIRAs I opened from the review that can have separate PRs. There was only one comment in shell.R that I think needs to be fixed before merging. |
|
Done. Thanks |
|
Thanks - Could you also bring this up to date with master branch ? |
b6b8712 to
4bc5449
Compare
|
Oops. done. |
|
Test build #60738 has finished for PR 13635 at commit
|
|
Test build #60735 has finished for PR 13635 at commit
|
|
Test passed and merge cleanly. |
|
LGTM. Merging this to master and branch-2.0 |
## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <[email protected]> Author: felixcheung <[email protected]> Closes #13635 from felixcheung/rsparksession. (cherry picked from commit 8c198e2) Signed-off-by: Shivaram Venkataraman <[email protected]>
|
awesome! |
|
Oh, great!! |
### What changes were proposed in this pull request? Add back the deprecated R APIs removed by #22843 and #22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from #9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at #13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? Add back the deprecated R APIs removed by #22843 and #22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from #9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at #13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit fd0b228) Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? Add back the deprecated R APIs removed by apache#22843 and apache#22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from apache#9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at apache#13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes apache#28058 from huaxingao/r. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
sparkR.session.getOrCreate()andsparkR.session.stop()"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
sparkR.init()), but with deprecation warningsparkR.session.getOrCreate()easilyenableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))catalogobject,createOrReplaceTempViewtables,tableNamessparkRshell is updated to use the SparkSession entrypoint (sqlContextis removed, just like with Scale/Python)read.jdbcis fixedTODO
How was this patch tested?
unit tests, manual tests
@shivaram @sun-rui @rxin