[SPARK-15159][SPARKR] SparkR SparkSession API #13635

felixcheung · 2016-06-13T06:27:58Z

What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
sparkR.session.getOrCreate() and sparkR.session.stop()

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should

SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
Changes to SparkSession is mostly transparent to users due to SPARK-10903
Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (sparkR.init()), but with deprecation warning
Mostly cosmetic changes to parameter list - users should be able to move to sparkR.session.getOrCreate() easily
An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))
Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: catalog object, createOrReplaceTempView
Other SQLContext workarounds are replicated in SparkR, eg. tables, tableNames
sparkR shell is updated to use the SparkSession entrypoint (sqlContext is removed, just like with Scale/Python)
All tests are updated to use the SparkSession entrypoint
A bug in read.jdbc is fixed

TODO

Add more tests
Separate PR - update all roxygen2 doc coding example
Separate PR - update SparkR programming guide

How was this patch tested?

unit tests, manual tests

@shivaram @sun-rui @rxin

felixcheung · 2016-06-13T06:28:21Z

R/pkg/inst/tests/testthat/test_context.R

need to open a JIRA on this.

Created https://issues.apache.org/jira/browse/SPARK-16027 for this

SparkQA · 2016-06-13T06:34:02Z

Test build #60384 has finished for PR 13635 at commit b494232.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-13T06:38:49Z

Test build #60385 has finished for PR 13635 at commit a48b7ea.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-13T08:27:01Z

Test build #60388 has finished for PR 13635 at commit 107dbef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-13T16:42:03Z

Thanks @felixcheung - I'll take a look at this today.

cc @rxin

shivaram · 2016-06-15T17:04:14Z

R/pkg/NAMESPACE

While the naming is accurate, this seems more verbose than before. So one option that i've been thinking about is what if we broke some backward compatibility and returned a SparkSession object from sparkR.init
The other option is to just call it sparkR.session and sparkR.session.stop

@rxin Any thoughts on this ?

I agree - I think it's important to have session in there and thought it would better to be more explicit about the getOrCreate behavior but as commented that name was unusual in R.

while SparkSession is the main entry, SparkContext does remain. My thoughts are:

Keep sparkR.init() as is. It still returns the SparkContext.

Add new API like SparkRSession.init(), and sparkRSession.stop().
SparkRSession.init() has two forms:
A. it can accept a SparkContext as a parameter, and no other Spark Configurations.
B. Just like the current sparkR.session.getOrCreate(), it internally creates SparkContext.

Keep sparkRSQL.init() and sparkRHive.init() for backward compatibility, while they are updated to call SparkRSession.init().

there is very limited use for SparkContext in R - do you really think we should keep/expose it?
and most of what you are suggesting is already implemented as such :) I think except sparkR.init() is deprecated (but still works for back compatibility)

Yeah I dont think we have any methods which accept a spark context in fact. So hiding it / not having an API where we can pass a SparkContext around is fine by me.

shivaram · 2016-06-15T23:20:23Z

Thanks @felixcheung for the PR. Other than the naming issues, I think the code changes look pretty good to me. I think there are some more docs, programming guide changes we'll need to make but I think we can do them in a follow up JIRA -- given that RC1 is quite close.

cc @sun-rui

sun-rui · 2016-06-16T04:47:13Z

@shivaram, I probably take a look at this tonight.

shivaram · 2016-06-16T17:22:35Z

sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala

Would be good to print a info message here that hive classes are not found and that we are creating a SparkContext without hive support

felixcheung · 2016-06-17T06:44:17Z

updated per feedback + more tests + rebased to master

SparkQA · 2016-06-17T08:36:57Z

Test build #60688 has finished for PR 13635 at commit 2ed0788.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-06-17T21:54:37Z

R/pkg/R/SQLContext.R

not related to this PR, but I think we need to deprecate dropTempTable and call it dropTempView in SparkR as well ? cc @liancheng

+1. There are several API changes related to catalog that we are not changing here, as well.

Will open a JIRA on this, I have the fix.

shivaram · 2016-06-17T22:51:47Z

@felixcheung Thanks for the update. The change looks pretty good to me. I think there are 2-3 follow up JIRAs I opened from the review that can have separate PRs. There was only one comment in shell.R that I think needs to be fixed before merging.

felixcheung · 2016-06-18T00:24:00Z

Done. Thanks

 Welcome to
    ____              __
   / __/__  ___ _____/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
    /_/


 SparkSession available as 'spark'.
> sc
Java ref type org.apache.spark.api.java.JavaSparkContext id 2

shivaram · 2016-06-18T00:33:21Z

Thanks - Could you also bring this up to date with master branch ?

felixcheung · 2016-06-18T00:44:06Z

Oops. done.

SparkQA · 2016-06-18T02:30:27Z

Test build #60738 has finished for PR 13635 at commit 4bc5449.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-18T02:37:27Z

Test build #60735 has finished for PR 13635 at commit b6b8712.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

felixcheung · 2016-06-18T03:28:08Z

Test passed and merge cleanly.

shivaram · 2016-06-18T04:35:24Z

LGTM. Merging this to master and branch-2.0

## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <[email protected]> Author: felixcheung <[email protected]> Closes #13635 from felixcheung/rsparksession. (cherry picked from commit 8c198e2) Signed-off-by: Shivaram Venkataraman <[email protected]>

felixcheung · 2016-06-18T04:38:59Z

awesome!

dongjoon-hyun · 2016-06-18T05:05:10Z

Oh, great!!

### What changes were proposed in this pull request? Add back the deprecated R APIs removed by #22843 and #22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from #9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at #13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? Add back the deprecated R APIs removed by #22843 and #22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from #9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at #13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes #28058 from huaxingao/r. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit fd0b228) Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? Add back the deprecated R APIs removed by apache#22843 and apache#22815. These APIs are - `sparkR.init` - `sparkRSQL.init` - `sparkRHive.init` - `registerTempTable` - `createExternalTable` - `dropTempTable` No need to port the function such as ```r createExternalTable <- function(x, ...) { dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...) } ``` because this was for the backward compatibility when SQLContext exists before assuming from apache#9192, but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at apache#13635. ### Why are the changes needed? Amend Spark's Semantic Versioning Policy ### Does this PR introduce any user-facing change? Yes The removed R APIs are put back. ### How was this patch tested? Add back the removed tests Closes apache#28058 from huaxingao/r. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

felixcheung reviewed Jun 13, 2016
View reviewed changes

shivaram reviewed Jun 15, 2016
View reviewed changes

shivaram reviewed Jun 16, 2016
View reviewed changes

felixcheung force-pushed the rsparksession branch from 107dbef to 2ed0788 Compare June 17, 2016 06:43

shivaram reviewed Jun 17, 2016
View reviewed changes

felixcheung added 9 commits June 17, 2016 17:37

[WIP] SparkSession in R

164629a

more changes for spark session

8e031ca

fix tests

35688e5

add support for updating config for existing session, fix read.jdbc

fde4491

update doc

e0750eb

fix scalastyle

88b200f

more test, comment feedback

c4d24c2

review feedback,

310b2cf

more update

3a413c5

sc should be JavaSparkContext

4bc5449

felixcheung force-pushed the rsparksession branch from b6b8712 to 4bc5449 Compare June 18, 2016 00:43

asfgit closed this in 8c198e2 Jun 18, 2016

huaxingao mentioned this pull request Mar 30, 2020

[SPARK-31290][R] Add back the deprecated R APIs #28058

Closed

[SPARK-15159][SPARKR] SparkR SparkSession API #13635

[SPARK-15159][SPARKR] SparkR SparkSession API #13635

Uh oh!

Conversation

felixcheung commented Jun 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 13, 2016

Uh oh!

SparkQA commented Jun 13, 2016

Uh oh!

SparkQA commented Jun 13, 2016

Uh oh!

shivaram commented Jun 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Jun 15, 2016

Uh oh!

sun-rui commented Jun 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jun 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Jun 17, 2016

Uh oh!

felixcheung commented Jun 18, 2016

Uh oh!

shivaram commented Jun 18, 2016

Uh oh!

felixcheung commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

felixcheung commented Jun 18, 2016

Uh oh!

shivaram commented Jun 18, 2016

Uh oh!

felixcheung commented Jun 18, 2016

Uh oh!

dongjoon-hyun commented Jun 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

felixcheung commented Jun 13, 2016 •

edited

Loading

felixcheung commented Jun 17, 2016 •

edited

Loading