Skip to content

Conversation

@falaki
Copy link
Contributor

@falaki falaki commented Oct 20, 2015

  • Changes api.r.SQLUtils to use SQLContext.getOrCreate instead of creating a new context.
  • Adds a simple test

[SPARK-11199] #comment link with JIRA

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #44006 has finished for PR 9185 at commit 58fff3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 21, 2015

cc @davies

@davies
Copy link
Contributor

davies commented Oct 21, 2015

@falaki SQLContext.getOrCreate could return HiveContext, it's slightly different than new SQLContext, is this what we want?

@falaki
Copy link
Contributor Author

falaki commented Oct 22, 2015

HiveContext is a subclass of SQLContext. So all of SQLContext functionality continues to work. cc @marmbrus what do you think?

@sun-rui
Copy link
Contributor

sun-rui commented Oct 22, 2015

On the R side, there is a cache of created SQLContext/HiveContext, so R won't call createSQLContext() second time. See https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L218 and https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L249. Also SPARK-11042 does prevent user from creating multiple root SQLContexts (if multiple root SQLContexts is not allowed). So there is no need to change createSQLContext().

However, I am not sure if we will support session in SparkR. If so, it would make sense to do such change as getOrCreate() will return the active SQLContext for current thread if it exists before returning the root SQLContext/HiveContext?

@davies, I am curious why you implemented createSQLContext() in Scala as a helper function for SparkR to create a SQLContext? It seems SparkR can directly use newJObject("org.apache.spark.sql.SQLContext", sc) to create a SQLContext just as the creation of HiveContext.

@felixcheung
Copy link
Member

I vote for simplicity for SparkR and not have multiple session.
In fact I observe it is already messy to handle DataFrame created by a different SparkContext (after stop() and init()). I would argue these concepts do not translate well to R - for which for the most part 'session' == 'process'

@shivaram
Copy link
Contributor

@davies @rxin Any updates on this ?

@davies
Copy link
Contributor

davies commented Oct 27, 2015

@falaki HiveContext has more functionality than SQLContext (Window functions, ORC files etc.), and a few semantic difference (how to parse decimal and intervals). Usually user will only want one, so it means the created one should be the one user also want in SparkR. I think this change is fine.

@sun-rui That change is not necessary, I may did not figure out the better way to do it.

@davies
Copy link
Contributor

davies commented Oct 27, 2015

We could merge this after pass tests.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #1957 has finished for PR 9185 at commit 58fff3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@rxin can you comment on how important sessions are (esp. with respect to SparkR) ? If they are not important we can significantly simplify things and just support one SQLContext (or HiveContext if thats used) for a SparkR session.

@rxin
Copy link
Contributor

rxin commented Nov 4, 2015

sessions are critical - actually one of the most important features for spark 1.6.

@shivaram
Copy link
Contributor

shivaram commented Nov 4, 2015

Is there an example of how people use sessions ? Or rather can @davies or you describe what is the API used to support sessions in Scala / Python ?

@marmbrus
Copy link
Contributor

marmbrus commented Nov 5, 2015

Sessions allow multiple users to share a Spark SQL Cluster without clobbering each other. Imagine you have multiple R sessions connected to the same Spark cluster. If one user runs sql("use database foo") its going to break all the other users that are connected.

One question is if a single instance of Spark R needs to be able to have more than one session (forgive me if this doesn't make sense as I'm super unfamiliar with our R arch). I think the most important thing is that its possible to inject an isolate session in to Spark R, not that in a single instance of Spark R you can have more than one session.

@sun-rui
Copy link
Contributor

sun-rui commented Nov 6, 2015

From user's point of view, multiple concurrent R sessions are expected to allow for parallel analysis when SparkR is running as service in cloud.

However, R at its core is single threaded, and it does support concept of session. So there exists some intermediate layers that enable multiple R sessions, for example, RStudio Server Pro and Rserve. Both of them enable multiple R sessions by spawning multiple R processes.

So my point is that within SparkR, we don't need to support SQL session and a single SQLContext is enough.

@shivaram
Copy link
Contributor

shivaram commented Nov 6, 2015

Thanks @marmbrus - That helps. Yeah so I think from my perspective it makes sense to just have one active session for SparkR and thus just one active SQLContext. This means that the change @felixcheung was working on to hide sqlContext from users would still be required.

We should however add support for two things (a) Specifying a session in sparkR.init (are sessions just named as strings ?) (b) a command to switch sessions which will change the sqlContext and all the necessary data structures under the covers so that commands run after this will run with the new session.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 6, 2015

I would propose that sparkR.init just has a flag that says if sessions should be isolated or not. When it connects it can either call SQLContext.getOrCreate(sc).newSession() or just SQLContext.getOrCreate(sc) based on that flag.

@felixcheung
Copy link
Member

@shivaram I did plan on changing sparkR.init() and sparkRSQL.init()
I like the idea @marmbrus proposed - we could have sparkR.init(new.sql.session = TRUE) to call SQLContext.getOrCreate(sc).newSession() - we could have the option of keeping the same SparkContext or stop/create a new one from a simple API call.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 6, 2015

/cc @davies

@davies
Copy link
Contributor

davies commented Nov 6, 2015

R is does not support threading, it's reasonable that SparkR does not support multiple sessions in the same R process in the same time, as we move to have a SQLContext a singleton in SparkR.

Considering two cases:

  1. the JVM is launched by R process, then call new SQLContext or SQLContext.getOrCreate will be the same.
  2. The JVM is launched first, then launch R process and pass the information of R backend into it. Should SparkR share the same SQL session with other applications in JVM or not? Then @marmbrus 's suggestion sounds like a better approach, giving use the ability to choose what they want.

@sun-rui
Copy link
Contributor

sun-rui commented Nov 10, 2015

@davies, I don't understand your two cases. SparkR is actually a standalone spark application in R, the JVM backend is dedicated, won't share with other applications.

SparkR is not a thrift client to an SparkSQL service, it won't be a session of an SparkSQL thrift service, so it makes no sense to add a flag about session into sparkR.init.

SparkR itself won't try to provide multi-session support (as SparkSQL thrift server does), because R is a single-threaded environment.

so my point is that a single root SQLContext is what we want.

@falaki
Copy link
Contributor Author

falaki commented Dec 28, 2015

ping @marmbrus

@marmbrus
Copy link
Contributor

This seems fine to me as a first step. Eventually we will probably want to make the RBackend multi-session aware.

@yhuai
Copy link
Contributor

yhuai commented Dec 29, 2015

test this please

@felixcheung
Copy link
Member

As pointed out above, R code actually does not call createSQLContext multiple times:

https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L243

sparkRSQL.init <- function(jsc = NULL) {
  if (exists(".sparkRSQLsc", envir = .sparkREnv)) {
    return(get(".sparkRSQLsc", envir = .sparkREnv))
  }
...
  sqlContext <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
                            "createSQLContext",
                            sc)
  assign(".sparkRSQLsc", sqlContext, envir = .sparkREnv)

@yhuai
Copy link
Contributor

yhuai commented Dec 29, 2015

test this please

@SparkQA
Copy link

SparkQA commented Dec 29, 2015

Test build #48386 has finished for PR 9185 at commit 0633a73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

LGTM

@yhuai
Copy link
Contributor

yhuai commented Dec 29, 2015

Merging to master

@asfgit asfgit closed this in f6ecf14 Dec 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants