[SPARKR] [SPARK-11199] Improve R context management story and add getOrCreate #9185

falaki · 2015-10-20T19:07:03Z

Changes api.r.SQLUtils to use SQLContext.getOrCreate instead of creating a new context.
Adds a simple test

[SPARK-11199] #comment link with JIRA

SparkQA · 2015-10-20T20:45:53Z

Test build #44006 has finished for PR 9185 at commit 58fff3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-10-21T02:34:07Z

cc @davies

davies · 2015-10-21T04:41:29Z

@falaki SQLContext.getOrCreate could return HiveContext, it's slightly different than new SQLContext, is this what we want?

falaki · 2015-10-22T01:59:21Z

HiveContext is a subclass of SQLContext. So all of SQLContext functionality continues to work. cc @marmbrus what do you think?

sun-rui · 2015-10-22T09:39:44Z

On the R side, there is a cache of created SQLContext/HiveContext, so R won't call createSQLContext() second time. See https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L218 and https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L249. Also SPARK-11042 does prevent user from creating multiple root SQLContexts (if multiple root SQLContexts is not allowed). So there is no need to change createSQLContext().

However, I am not sure if we will support session in SparkR. If so, it would make sense to do such change as getOrCreate() will return the active SQLContext for current thread if it exists before returning the root SQLContext/HiveContext?

@davies, I am curious why you implemented createSQLContext() in Scala as a helper function for SparkR to create a SQLContext? It seems SparkR can directly use newJObject("org.apache.spark.sql.SQLContext", sc) to create a SQLContext just as the creation of HiveContext.

felixcheung · 2015-10-22T18:41:05Z

I vote for simplicity for SparkR and not have multiple session.
In fact I observe it is already messy to handle DataFrame created by a different SparkContext (after stop() and init()). I would argue these concepts do not translate well to R - for which for the most part 'session' == 'process'

shivaram · 2015-10-27T16:38:23Z

@davies @rxin Any updates on this ?

davies · 2015-10-27T16:55:56Z

@falaki HiveContext has more functionality than SQLContext (Window functions, ORC files etc.), and a few semantic difference (how to parse decimal and intervals). Usually user will only want one, so it means the created one should be the one user also want in SparkR. I think this change is fine.

@sun-rui That change is not necessary, I may did not figure out the better way to do it.

davies · 2015-10-27T16:56:36Z

We could merge this after pass tests.

SparkQA · 2015-10-27T19:00:26Z

Test build #1957 has finished for PR 9185 at commit 58fff3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-31T21:40:37Z

@rxin can you comment on how important sessions are (esp. with respect to SparkR) ? If they are not important we can significantly simplify things and just support one SQLContext (or HiveContext if thats used) for a SparkR session.

rxin · 2015-11-04T07:25:47Z

sessions are critical - actually one of the most important features for spark 1.6.

shivaram · 2015-11-04T16:26:31Z

Is there an example of how people use sessions ? Or rather can @davies or you describe what is the API used to support sessions in Scala / Python ?

marmbrus · 2015-11-05T22:16:41Z

Sessions allow multiple users to share a Spark SQL Cluster without clobbering each other. Imagine you have multiple R sessions connected to the same Spark cluster. If one user runs sql("use database foo") its going to break all the other users that are connected.

One question is if a single instance of Spark R needs to be able to have more than one session (forgive me if this doesn't make sense as I'm super unfamiliar with our R arch). I think the most important thing is that its possible to inject an isolate session in to Spark R, not that in a single instance of Spark R you can have more than one session.

sun-rui · 2015-11-06T07:26:23Z

From user's point of view, multiple concurrent R sessions are expected to allow for parallel analysis when SparkR is running as service in cloud.

However, R at its core is single threaded, and it does support concept of session. So there exists some intermediate layers that enable multiple R sessions, for example, RStudio Server Pro and Rserve. Both of them enable multiple R sessions by spawning multiple R processes.

So my point is that within SparkR, we don't need to support SQL session and a single SQLContext is enough.

shivaram · 2015-11-06T18:17:22Z

Thanks @marmbrus - That helps. Yeah so I think from my perspective it makes sense to just have one active session for SparkR and thus just one active SQLContext. This means that the change @felixcheung was working on to hide sqlContext from users would still be required.

We should however add support for two things (a) Specifying a session in sparkR.init (are sessions just named as strings ?) (b) a command to switch sessions which will change the sqlContext and all the necessary data structures under the covers so that commands run after this will run with the new session.

marmbrus · 2015-11-06T20:33:30Z

I would propose that sparkR.init just has a flag that says if sessions should be isolated or not. When it connects it can either call SQLContext.getOrCreate(sc).newSession() or just SQLContext.getOrCreate(sc) based on that flag.

felixcheung · 2015-11-06T20:52:00Z

@shivaram I did plan on changing sparkR.init() and sparkRSQL.init()
I like the idea @marmbrus proposed - we could have sparkR.init(new.sql.session = TRUE) to call SQLContext.getOrCreate(sc).newSession() - we could have the option of keeping the same SparkContext or stop/create a new one from a simple API call.

marmbrus · 2015-11-06T20:54:36Z

/cc @davies

davies · 2015-11-06T21:34:17Z

R is does not support threading, it's reasonable that SparkR does not support multiple sessions in the same R process in the same time, as we move to have a SQLContext a singleton in SparkR.

Considering two cases:

the JVM is launched by R process, then call new SQLContext or SQLContext.getOrCreate will be the same.
The JVM is launched first, then launch R process and pass the information of R backend into it. Should SparkR share the same SQL session with other applications in JVM or not? Then @marmbrus 's suggestion sounds like a better approach, giving use the ability to choose what they want.

sun-rui · 2015-11-10T12:21:25Z

@davies, I don't understand your two cases. SparkR is actually a standalone spark application in R, the JVM backend is dedicated, won't share with other applications.

SparkR is not a thrift client to an SparkSQL service, it won't be a session of an SparkSQL thrift service, so it makes no sense to add a flag about session into sparkR.init.

SparkR itself won't try to provide multi-session support (as SparkSQL thrift server does), because R is a single-threaded environment.

so my point is that a single root SQLContext is what we want.

falaki · 2015-12-28T21:57:44Z

ping @marmbrus

marmbrus · 2015-12-28T22:04:50Z

This seems fine to me as a first step. Eventually we will probably want to make the RBackend multi-session aware.

yhuai · 2015-12-29T00:38:32Z

test this please

felixcheung · 2015-12-29T00:57:38Z

As pointed out above, R code actually does not call createSQLContext multiple times:

https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L243

sparkRSQL.init <- function(jsc = NULL) {
  if (exists(".sparkRSQLsc", envir = .sparkREnv)) {
    return(get(".sparkRSQLsc", envir = .sparkREnv))
  }
...
  sqlContext <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
                            "createSQLContext",
                            sc)
  assign(".sparkRSQLsc", sqlContext, envir = .sparkREnv)

yhuai · 2015-12-29T00:59:09Z

test this please

SparkQA · 2015-12-29T02:47:43Z

Test build #48386 has finished for PR 9185 at commit 0633a73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-12-29T07:19:27Z

LGTM

yhuai · 2015-12-29T19:43:39Z

Merging to master

Using getOrCreate instead of creating new SQLContext

58fff3f

shivaram mentioned this pull request Oct 22, 2015

[SPARK-10903] [SPARKR] R - Simplify SQLContext method signatures and use a singleton #9192

Closed

Merge branch 'master' into SPARK-11199

0633a73

asfgit closed this in f6ecf14 Dec 29, 2015

[SPARKR] [SPARK-11199] Improve R context management story and add getOrCreate #9185

[SPARKR] [SPARK-11199] Improve R context management story and add getOrCreate #9185

Uh oh!

Conversation

falaki commented Oct 20, 2015

Uh oh!

SparkQA commented Oct 20, 2015

Uh oh!

rxin commented Oct 21, 2015

Uh oh!

davies commented Oct 21, 2015

Uh oh!

falaki commented Oct 22, 2015

Uh oh!

sun-rui commented Oct 22, 2015

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

shivaram commented Oct 27, 2015

Uh oh!

davies commented Oct 27, 2015

Uh oh!

davies commented Oct 27, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

shivaram commented Oct 31, 2015

Uh oh!

rxin commented Nov 4, 2015

Uh oh!

shivaram commented Nov 4, 2015

Uh oh!

marmbrus commented Nov 5, 2015

Uh oh!

sun-rui commented Nov 6, 2015

Uh oh!

shivaram commented Nov 6, 2015

Uh oh!

marmbrus commented Nov 6, 2015

Uh oh!

felixcheung commented Nov 6, 2015

Uh oh!

marmbrus commented Nov 6, 2015

Uh oh!

davies commented Nov 6, 2015

Uh oh!

sun-rui commented Nov 10, 2015

Uh oh!

falaki commented Dec 28, 2015

Uh oh!

marmbrus commented Dec 28, 2015

Uh oh!

yhuai commented Dec 29, 2015

Uh oh!

felixcheung commented Dec 29, 2015

Uh oh!

yhuai commented Dec 29, 2015

Uh oh!

SparkQA commented Dec 29, 2015

Uh oh!

shivaram commented Dec 29, 2015

Uh oh!

yhuai commented Dec 29, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants