Skip to content

Conversation

@felixcheung
Copy link
Member

@felixcheung felixcheung commented Jun 13, 2016

What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
sparkR.session.getOrCreate() and sparkR.session.stop()

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should

  • SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
  • SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
  • Changes to SparkSession is mostly transparent to users due to SPARK-10903
  • Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (sparkR.init()), but with deprecation warning
  • Mostly cosmetic changes to parameter list - users should be able to move to sparkR.session.getOrCreate() easily
  • An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))
  • Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
  • Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: catalog object, createOrReplaceTempView
  • Other SQLContext workarounds are replicated in SparkR, eg. tables, tableNames
  • sparkR shell is updated to use the SparkSession entrypoint (sqlContext is removed, just like with Scale/Python)
  • All tests are updated to use the SparkSession entrypoint
  • A bug in read.jdbc is fixed

TODO

  • Add more tests
  • Separate PR - update all roxygen2 doc coding example
  • Separate PR - update SparkR programming guide

How was this patch tested?

unit tests, manual tests

@shivaram @sun-rui @rxin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to open a JIRA on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60384 has finished for PR 13635 at commit b494232.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60385 has finished for PR 13635 at commit a48b7ea.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2016

Test build #60388 has finished for PR 13635 at commit 107dbef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks @felixcheung - I'll take a look at this today.

cc @rxin

R/pkg/NAMESPACE Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the naming is accurate, this seems more verbose than before. So one option that i've been thinking about is what if we broke some backward compatibility and returned a SparkSession object from sparkR.init
The other option is to just call it sparkR.session and sparkR.session.stop

@rxin Any thoughts on this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I think it's important to have session in there and thought it would better to be more explicit about the getOrCreate behavior but as commented that name was unusual in R.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while SparkSession is the main entry, SparkContext does remain. My thoughts are:

  1. Keep sparkR.init() as is. It still returns the SparkContext.
  2. Add new API like SparkRSession.init(), and sparkRSession.stop().
    SparkRSession.init() has two forms:
    A. it can accept a SparkContext as a parameter, and no other Spark Configurations.
    B. Just like the current sparkR.session.getOrCreate(), it internally creates SparkContext.
  3. Keep sparkRSQL.init() and sparkRHive.init() for backward compatibility, while they are updated to call SparkRSession.init().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is very limited use for SparkContext in R - do you really think we should keep/expose it?
and most of what you are suggesting is already implemented as such :) I think except sparkR.init() is deprecated (but still works for back compatibility)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I dont think we have any methods which accept a spark context in fact. So hiding it / not having an API where we can pass a SparkContext around is fine by me.

@shivaram
Copy link
Contributor

Thanks @felixcheung for the PR. Other than the naming issues, I think the code changes look pretty good to me. I think there are some more docs, programming guide changes we'll need to make but I think we can do them in a follow up JIRA -- given that RC1 is quite close.

cc @sun-rui

@sun-rui
Copy link
Contributor

sun-rui commented Jun 16, 2016

@shivaram, I probably take a look at this tonight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to print a info message here that hive classes are not found and that we are creating a SparkContext without hive support

@felixcheung
Copy link
Member Author

felixcheung commented Jun 17, 2016

updated per feedback + more tests + rebased to master

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60688 has finished for PR 13635 at commit 2ed0788.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but I think we need to deprecate dropTempTable and call it dropTempView in SparkR as well ? cc @liancheng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. There are several API changes related to catalog that we are not changing here, as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a JIRA on this, I have the fix.

@shivaram
Copy link
Contributor

@felixcheung Thanks for the update. The change looks pretty good to me. I think there are 2-3 follow up JIRAs I opened from the review that can have separate PRs. There was only one comment in shell.R that I think needs to be fixed before merging.

@felixcheung
Copy link
Member Author

Done. Thanks

 Welcome to
    ____              __
   / __/__  ___ _____/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
    /_/


 SparkSession available as 'spark'.
> sc
Java ref type org.apache.spark.api.java.JavaSparkContext id 2

@shivaram
Copy link
Contributor

Thanks - Could you also bring this up to date with master branch ?

@felixcheung
Copy link
Member Author

Oops. done.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60738 has finished for PR 13635 at commit 4bc5449.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60735 has finished for PR 13635 at commit b6b8712.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

Test passed and merge cleanly.

@shivaram
Copy link
Contributor

LGTM. Merging this to master and branch-2.0

@asfgit asfgit closed this in 8c198e2 Jun 18, 2016
asfgit pushed a commit that referenced this pull request Jun 18, 2016
## What changes were proposed in this pull request?

This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`

"getOrCreate" is a bit unusual in R but it's important to name this clearly.

SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed

TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide

## How was this patch tested?

unit tests, manual tests

shivaram sun-rui rxin

Author: Felix Cheung <[email protected]>
Author: felixcheung <[email protected]>

Closes #13635 from felixcheung/rsparksession.

(cherry picked from commit 8c198e2)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@felixcheung
Copy link
Member Author

awesome!

@dongjoon-hyun
Copy link
Member

Oh, great!!

HyukjinKwon pushed a commit that referenced this pull request Apr 1, 2020
### What changes were proposed in this pull request?
Add back the deprecated R APIs removed by #22843 and #22815.

These APIs are

- `sparkR.init`
- `sparkRSQL.init`
- `sparkRHive.init`
- `registerTempTable`
- `createExternalTable`
- `dropTempTable`

No need to port the function such as
```r
createExternalTable <- function(x, ...) {
  dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...)
}
```
because this was for the backward compatibility when SQLContext exists before assuming from #9192,  but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at #13635.

### Why are the changes needed?
Amend Spark's Semantic Versioning Policy

### Does this PR introduce any user-facing change?
Yes
The removed R APIs are put back.

### How was this patch tested?
Add back the removed tests

Closes #28058 from huaxingao/r.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Apr 1, 2020
### What changes were proposed in this pull request?
Add back the deprecated R APIs removed by #22843 and #22815.

These APIs are

- `sparkR.init`
- `sparkRSQL.init`
- `sparkRHive.init`
- `registerTempTable`
- `createExternalTable`
- `dropTempTable`

No need to port the function such as
```r
createExternalTable <- function(x, ...) {
  dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...)
}
```
because this was for the backward compatibility when SQLContext exists before assuming from #9192,  but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at #13635.

### Why are the changes needed?
Amend Spark's Semantic Versioning Policy

### Does this PR introduce any user-facing change?
Yes
The removed R APIs are put back.

### How was this patch tested?
Add back the removed tests

Closes #28058 from huaxingao/r.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit fd0b228)
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Add back the deprecated R APIs removed by apache#22843 and apache#22815.

These APIs are

- `sparkR.init`
- `sparkRSQL.init`
- `sparkRHive.init`
- `registerTempTable`
- `createExternalTable`
- `dropTempTable`

No need to port the function such as
```r
createExternalTable <- function(x, ...) {
  dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...)
}
```
because this was for the backward compatibility when SQLContext exists before assuming from apache#9192,  but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at apache#13635.

### Why are the changes needed?
Amend Spark's Semantic Versioning Policy

### Does this PR introduce any user-facing change?
Yes
The removed R APIs are put back.

### How was this patch tested?
Add back the removed tests

Closes apache#28058 from huaxingao/r.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants