[SPARK-2087] [SQL] Multiple thriftserver sessions with different HiveContext instances #4382

chenghao-intel · 2015-02-05T03:11:10Z

Within thriftserver mode, only a single HiveContext instance for the process, and it leads to sqlconfobject shared across multiple user sessions.
In order to isolate the sqlconf for each user session, we create new HiveContext instance for each of the user session. However, we also want keep the existed logic like multiple users will share the same catalog and cache, hence I pull out the catalog, cachemanager as well as functionRegistry as global unique instance.

Cross HiveContext catalog, cache management, function, view etc. (Covered by this PR)
Context wide scope for temporal table / function / view, multiple users will not share with each other
Cross processes catalog, cache management etc. (manage the cached data in external cache layer, e.g. based on Tachyon)

SparkQA · 2015-02-05T03:12:42Z

Test build #26816 has started for PR 4382 at commit 403d6ec.

This patch merges cleanly.

chenghao-intel · 2015-02-05T03:15:33Z

@liancheng Do you have any idea how to collect the thrift sever logs in the unit test? It says timeout exception, and I believe either port error or the server process exited due to some error (like the create the hive metastore client failure).

guowei2 · 2015-02-05T04:15:40Z

@chenghao-intel
I think there is no need to make HiveContext as ThreadLocal. and SessionState ThreadLocal is enough.
https://github.com/guowei2/spark/compare/SPARK-4815?expand=1
do we fix the same issue?

chenghao-intel · 2015-02-05T04:26:24Z

@guowei2 Probably no. HiveContext has it's own internal metastore (temp metastore), SQLConf instance etc. I don't think multiple users want to share those info when they connect the same thriftserver. SessionState actually wraps the internal state as the local thread internally, hence I didn't change that in my PR.

SparkQA · 2015-02-05T04:31:47Z

Test build #26816 has finished for PR 4382 at commit 403d6ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-05T04:31:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26816/
Test PASSed.

chenghao-intel · 2015-02-05T04:35:21Z

Seems HiveThriftServer2Suite didn't run, is it disabled by default?

mallman · 2015-02-05T16:53:40Z

FWIW I'd like to add my two cents. The main piece of functionality the installation at my company would benefit from is independent user sessions. I'm not familiar enough with the source to say exactly what that means in terms of a source patch, but one of the key use cases is the ability to set the session default database ("use ") and SQLConf settings independent of other beeline connections. Right now, setting the database sets it across all connections and that is a major impediment to wider use of a shared thriftserver.

Cheers!

chenghao-intel · 2015-02-06T06:00:24Z

@mallman this PR exactly aims to fix the bug you mentioned, and it passed the tested in my local machine. However, I am still figuring out some of the unit testing failures, hopefully I can update the title by removing the "WIP" soon.

chenghao-intel · 2015-02-06T07:19:20Z

@liancheng Seems HiveThriftServer2Suite didn't run, is it disabled by default?

liancheng · 2015-02-10T00:24:37Z

@chenghao-intel This is just fixed by #4486.

chenghao-intel · 2015-02-10T00:25:20Z

Should I retest this?

chenghao-intel · 2015-02-10T00:28:22Z

test this please.

liancheng · 2015-02-10T00:29:06Z

HiveThriftServer2Suite timeout is also fixed, please refer to #4484.

liancheng · 2015-02-10T00:29:32Z

Yeah, you should retest.

SparkQA · 2015-02-10T00:33:06Z

Test build #27143 has started for PR 4382 at commit 403d6ec.

This patch merges cleanly.

SparkQA · 2015-02-10T01:44:26Z

Test build #27143 has finished for PR 4382 at commit 403d6ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-10T01:44:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27143/
Test PASSed.

chenghao-intel · 2015-02-10T01:50:46Z

@liancheng Seems the HiveThriftServer2Suite still not be triggered.
test this please.

liancheng · 2015-02-10T02:00:07Z

retest this please

SparkQA · 2015-02-10T02:02:58Z

Test build #27158 has started for PR 4382 at commit 403d6ec.

This patch merges cleanly.

SparkQA · 2015-02-10T03:21:25Z

Test build #27158 has finished for PR 4382 at commit 403d6ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-10T03:21:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27158/
Test PASSed.

liancheng · 2015-02-10T19:05:47Z

@chenghao-intel The biggest issue I see in this PR is that users can no longer share cached tables, which breaks many existing use scenarios. I'm afraid it may require major efforts to support both multi-session and cache sharing. I can think of two alternatives:

All sessions share a single HiveContext to enable cache sharing, and provide one SQLConf per session to isolate session configurations.
Provide one HiveContext per session to isolate session configuration, but extract cached table meta data management out of SQLContext / HiveContext to some global place, so that different HiveContext can figure out the mapping between cached tables (actually query plans) and their underlying materialized in-memory columnar RDDs).

chenghao-intel · 2015-02-11T00:25:53Z

Yeah, I can understand people want to share the cached table among multi-sessions, is there any potential requirement that people just want to keep the 'temp' table visibility within the HiveContext (will not share with the other session)?
Probably the option 2 is a better if we want to support the feature above in the future.
Anyway, thanks for reviewing, I will update the code.

guowei2 · 2015-02-12T01:58:29Z

I agree with liancheng, user want to share cached tables in many cases.
I still think isolate all resources in HiveContext is not a good idea.
Whether to use cached table created by different user is the responsibility the authority does.
we should not always keep them invisibility
I still think SessionState ThreadLocal is enough. and hiveconf can get from SessionState
This is what hive does with thriftServer

SparkQA · 2015-02-12T05:52:26Z

Test build #27326 has started for PR 4382 at commit 73922ae.

This patch merges cleanly.

chenghao-intel · 2015-02-12T06:31:18Z

I agree with @liancheng, too, both code and description are updated.

SparkQA · 2015-02-12T06:40:50Z

Test build #27326 has finished for PR 4382 at commit 73922ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait DataFrame extends RDDApi[Row] with Serializable

AmplabJenkins · 2015-02-12T06:40:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27326/
Test FAILed.

chenghao-intel · 2015-02-12T06:59:08Z

Seems failure due to the irrelevant code.

chenghao-intel · 2015-02-12T06:59:14Z

retest this please

SparkQA · 2015-02-12T07:02:26Z

Test build #27328 has started for PR 4382 at commit 73922ae.

This patch merges cleanly.

SparkQA · 2015-02-12T07:51:40Z

Test build #27328 has finished for PR 4382 at commit 73922ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-12T07:51:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27328/
Test FAILed.

SparkQA · 2015-02-15T02:47:32Z

Test build #27501 has started for PR 4382 at commit 9d3b296.

This patch merges cleanly.

SparkQA · 2015-02-15T03:37:08Z

Test build #27501 has finished for PR 4382 at commit 9d3b296.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-15T03:37:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27501/
Test FAILed.

SparkQA · 2015-02-22T04:22:38Z

Test build #27830 has started for PR 4382 at commit 197e806.

This patch merges cleanly.

SparkQA · 2015-02-22T05:34:37Z

Test build #27830 has finished for PR 4382 at commit 197e806.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-22T05:34:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27830/
Test PASSed.

chenghao-intel · 2015-02-23T12:47:12Z

/cc @liancheng can you review this for me?

chenghao-intel · 2015-02-25T00:53:59Z

/cc @liancheng @marmbrus can you give some high level comments? and then I can start the rebasing (again)

liancheng · 2015-02-25T05:34:35Z

Hey @chenghao-intel, thanks for working on this, AFAIK this is a pain point for many Spark SQL users who would like to put HiveThriftServer2 into production. Also had a discussion with @marmbrus about this recently.

As we've discussed offline, instead of changing CacheManager and FunctionRegistry to global instances, adding a SQLSession and moving per-session objects (configurations, temporary functions, etc.) to it could be more preferrable. To be more specific:

Add a new SQLSession class, which is responsible to maintain all per-session objects, like configurations, temporary functions, etc..
Add a session field of type SQLSession in SQLContext, and override it in HiveContext, then put Hive specific per-session objects into it, like Hive client, Hive session state, etc..
Add the following session specific methods to SQLContext:
- createSession: SQLSession
- currentSession: SQLSession
- setSession(session: SQLSession): Unit
- closeSession(session: SQLSession)
These methods should be private[sql] as they are subject to change. Currently we can just mimic Hive behavior, for example, using thread-local instances just like what Hive session state does. (You may see the SQLSession object within HiveContext a thin wrapper of SessionState together with other per-session components.)

The benefits of this approach are:

In the long run, we'd like to move HiveContext out of the main framework and make Hive a separate data source. With the above approach, it's more natural to build a separate Spark SQL server with multi-user support without depending on Hive specific code.
Making components like CacheManager global objects are not test friendly. Basically it's impossible to make Spark SQL tests run in parallel.

chenghao-intel · 2015-02-25T07:55:31Z

@liancheng thank you very much for the so detailed comment! Actually I am quite fighting with the 2 approaches:
Single HiveContext Instance (with the thread local of SQLSession) V.S. Multiple HiveContext Instances(with global CacheManager and Hive Metastore instance etc.)

The main reasons I choice the later approach is people can create arbitrary number of HiveContext or SQLContext instances when using DF API, and then they may reference the created instances within the same thread, in ThreadLocal(former one) approach, it still will causes unpredictable behavior, unless we make the HiveContext or SQLContext as global unique instance, but this is not we want, right?

Whatever approach we take, it's intuitive that the CacheManager, HiveMetaStore, FunctionRegistry can be Context independent and thread-safe, and we can put the Context dependent information into SQLContext or HiveContext as property, like the sqlconf and sessionState.

What do you think?

tianyi · 2015-03-03T02:16:07Z

@chenghao-intel I'm little confusing about why people have to create multiple HiveContext or SQLContext instances in the same thread. Could u provide some cases?

liancheng · 2015-03-03T07:50:07Z

@chenghao-intel I'm posting the summary of our offline discussion here for future reference:

SQLContext.CacheManager maps to SparkContext.CacheManager in a one-to-one manner. Once the SQL cache manager is made a global object, users can only use a single SparkContext in a single process. This limitation is neither wanted nor necessary. On the other hand, moving session specific stuff to SQLSession and leaving stuff shared by multiple sessions in SQLContext/HiveContext avoids this problem.

SparkQA · 2015-03-09T02:37:46Z

Test build #28377 has started for PR 4382 at commit 197e806.

This patch does not merge cleanly.

SparkQA · 2015-03-09T04:16:26Z

Test build #28377 has finished for PR 4382 at commit 197e806.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-09T04:16:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28377/
Test PASSed.

chenghao-intel · 2015-03-17T00:19:05Z

Closing it since #4885 has been merged.

chenghao-intel changed the title ~~[SPARK-2087] [SQL] [WIP] Multiple thriftserver sessions with different HiveContext instances~~ [SPARK-2087] [SQL] Multiple thriftserver sessions with different HiveContext instances Feb 7, 2015

chenghao-intel force-pushed the multisessions branch from 403d6ec to 73922ae Compare February 12, 2015 05:51

chenghao-intel force-pushed the multisessions branch from 73922ae to 9d3b296 Compare February 15, 2015 02:44

chenghao-intel mentioned this pull request Feb 15, 2015

[SPARK-5825] [Spark Submit] Remove the double checking instance name when stopping the service #4611

Closed

Support the multiple sessions for ThriftServer

197e806

chenghao-intel force-pushed the multisessions branch from 9d3b296 to 197e806 Compare February 22, 2015 04:20

chenghao-intel mentioned this pull request Mar 4, 2015

[SPARK-2087] [SQL] Multiple thriftserver sessions with single HiveContext instance #4885

Closed

chenghao-intel closed this Mar 17, 2015

chenghao-intel deleted the multisessions branch July 2, 2015 08:38

[SPARK-2087] [SQL] Multiple thriftserver sessions with different HiveContext instances #4382

[SPARK-2087] [SQL] Multiple thriftserver sessions with different HiveContext instances #4382

Uh oh!

Conversation

chenghao-intel commented Feb 5, 2015

Uh oh!

SparkQA commented Feb 5, 2015

Uh oh!

chenghao-intel commented Feb 5, 2015

Uh oh!

guowei2 commented Feb 5, 2015

Uh oh!

chenghao-intel commented Feb 5, 2015

Uh oh!

SparkQA commented Feb 5, 2015

Uh oh!

AmplabJenkins commented Feb 5, 2015

Uh oh!

chenghao-intel commented Feb 5, 2015

Uh oh!

mallman commented Feb 5, 2015

Uh oh!

chenghao-intel commented Feb 6, 2015

Uh oh!

chenghao-intel commented Feb 6, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

chenghao-intel commented Feb 10, 2015

Uh oh!

chenghao-intel commented Feb 10, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

SparkQA commented Feb 10, 2015

Uh oh!

SparkQA commented Feb 10, 2015

Uh oh!

AmplabJenkins commented Feb 10, 2015

Uh oh!

chenghao-intel commented Feb 10, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

SparkQA commented Feb 10, 2015

Uh oh!

SparkQA commented Feb 10, 2015

Uh oh!

AmplabJenkins commented Feb 10, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

chenghao-intel commented Feb 11, 2015

Uh oh!

guowei2 commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

chenghao-intel commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

AmplabJenkins commented Feb 12, 2015

Uh oh!

chenghao-intel commented Feb 12, 2015

Uh oh!

chenghao-intel commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

AmplabJenkins commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 15, 2015

Uh oh!

SparkQA commented Feb 15, 2015

Uh oh!