-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARKR] [SPARK-11199] Improve R context management story and add getOrCreate #9185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #44006 has finished for PR 9185 at commit
|
|
cc @davies |
|
@falaki SQLContext.getOrCreate could return HiveContext, it's slightly different than |
|
HiveContext is a subclass of SQLContext. So all of SQLContext functionality continues to work. cc @marmbrus what do you think? |
|
On the R side, there is a cache of created SQLContext/HiveContext, so R won't call createSQLContext() second time. See https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L218 and https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L249. Also SPARK-11042 does prevent user from creating multiple root SQLContexts (if multiple root SQLContexts is not allowed). So there is no need to change createSQLContext(). However, I am not sure if we will support session in SparkR. If so, it would make sense to do such change as getOrCreate() will return the active SQLContext for current thread if it exists before returning the root SQLContext/HiveContext? @davies, I am curious why you implemented createSQLContext() in Scala as a helper function for SparkR to create a SQLContext? It seems SparkR can directly use newJObject("org.apache.spark.sql.SQLContext", sc) to create a SQLContext just as the creation of HiveContext. |
|
I vote for simplicity for SparkR and not have multiple session. |
|
@falaki HiveContext has more functionality than SQLContext (Window functions, ORC files etc.), and a few semantic difference (how to parse decimal and intervals). Usually user will only want one, so it means the created one should be the one user also want in SparkR. I think this change is fine. @sun-rui That change is not necessary, I may did not figure out the better way to do it. |
|
We could merge this after pass tests. |
|
Test build #1957 has finished for PR 9185 at commit
|
|
@rxin can you comment on how important sessions are (esp. with respect to SparkR) ? If they are not important we can significantly simplify things and just support one SQLContext (or HiveContext if thats used) for a SparkR session. |
|
sessions are critical - actually one of the most important features for spark 1.6. |
|
Is there an example of how people use sessions ? Or rather can @davies or you describe what is the API used to support sessions in Scala / Python ? |
|
Sessions allow multiple users to share a Spark SQL Cluster without clobbering each other. Imagine you have multiple R sessions connected to the same Spark cluster. If one user runs One question is if a single instance of Spark R needs to be able to have more than one session (forgive me if this doesn't make sense as I'm super unfamiliar with our R arch). I think the most important thing is that its possible to inject an isolate session in to Spark R, not that in a single instance of Spark R you can have more than one session. |
|
From user's point of view, multiple concurrent R sessions are expected to allow for parallel analysis when SparkR is running as service in cloud. However, R at its core is single threaded, and it does support concept of session. So there exists some intermediate layers that enable multiple R sessions, for example, RStudio Server Pro and Rserve. Both of them enable multiple R sessions by spawning multiple R processes. So my point is that within SparkR, we don't need to support SQL session and a single SQLContext is enough. |
|
Thanks @marmbrus - That helps. Yeah so I think from my perspective it makes sense to just have one active session for SparkR and thus just one active SQLContext. This means that the change @felixcheung was working on to hide We should however add support for two things (a) Specifying a session in |
|
I would propose that sparkR.init just has a flag that says if sessions should be isolated or not. When it connects it can either call |
|
@shivaram I did plan on changing |
|
/cc @davies |
|
R is does not support threading, it's reasonable that SparkR does not support multiple sessions in the same R process in the same time, as we move to have a SQLContext a singleton in SparkR. Considering two cases:
|
|
@davies, I don't understand your two cases. SparkR is actually a standalone spark application in R, the JVM backend is dedicated, won't share with other applications. SparkR is not a thrift client to an SparkSQL service, it won't be a session of an SparkSQL thrift service, so it makes no sense to add a flag about session into sparkR.init. SparkR itself won't try to provide multi-session support (as SparkSQL thrift server does), because R is a single-threaded environment. so my point is that a single root SQLContext is what we want. |
|
ping @marmbrus |
|
This seems fine to me as a first step. Eventually we will probably want to make the RBackend multi-session aware. |
|
test this please |
|
As pointed out above, R code actually does not call https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L243 |
|
test this please |
|
Test build #48386 has finished for PR 9185 at commit
|
|
LGTM |
|
Merging to master |
SQLContext.getOrCreateinstead of creating a new context.[SPARK-11199] #comment link with JIRA