Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

Within thriftserver mode, only a single HiveContext instance for the process, and it leads to sqlconfobject shared across multiple user sessions.
In order to isolate the sqlconf for each user session, we create new HiveContext instance for each of the user session. However, we also want keep the existed logic like multiple users will share the same catalog and cache, hence I pull out the catalog, cachemanager as well as functionRegistry as global unique instance.

  • Cross HiveContext catalog, cache management, function, view etc. (Covered by this PR)
  • Context wide scope for temporal table / function / view, multiple users will not share with each other
  • Cross processes catalog, cache management etc. (manage the cached data in external cache layer, e.g. based on Tachyon)

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26816 has started for PR 4382 at commit 403d6ec.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

@liancheng Do you have any idea how to collect the thrift sever logs in the unit test? It says timeout exception, and I believe either port error or the server process exited due to some error (like the create the hive metastore client failure).

@guowei2
Copy link
Contributor

guowei2 commented Feb 5, 2015

@chenghao-intel
I think there is no need to make HiveContext as ThreadLocal. and SessionState ThreadLocal is enough.
https://github.com/guowei2/spark/compare/SPARK-4815?expand=1
do we fix the same issue?

@chenghao-intel
Copy link
Contributor Author

@guowei2 Probably no. HiveContext has it's own internal metastore (temp metastore), SQLConf instance etc. I don't think multiple users want to share those info when they connect the same thriftserver. SessionState actually wraps the internal state as the local thread internally, hence I didn't change that in my PR.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26816 has finished for PR 4382 at commit 403d6ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26816/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

Seems HiveThriftServer2Suite didn't run, is it disabled by default?

@mallman
Copy link
Contributor

mallman commented Feb 5, 2015

FWIW I'd like to add my two cents. The main piece of functionality the installation at my company would benefit from is independent user sessions. I'm not familiar enough with the source to say exactly what that means in terms of a source patch, but one of the key use cases is the ability to set the session default database ("use ") and SQLConf settings independent of other beeline connections. Right now, setting the database sets it across all connections and that is a major impediment to wider use of a shared thriftserver.

Cheers!

@chenghao-intel
Copy link
Contributor Author

@mallman this PR exactly aims to fix the bug you mentioned, and it passed the tested in my local machine. However, I am still figuring out some of the unit testing failures, hopefully I can update the title by removing the "WIP" soon.

@chenghao-intel
Copy link
Contributor Author

@liancheng Seems HiveThriftServer2Suite didn't run, is it disabled by default?

@chenghao-intel chenghao-intel changed the title [SPARK-2087] [SQL] [WIP] Multiple thriftserver sessions with different HiveContext instances [SPARK-2087] [SQL] Multiple thriftserver sessions with different HiveContext instances Feb 7, 2015
@liancheng
Copy link
Contributor

@chenghao-intel This is just fixed by #4486.

@chenghao-intel
Copy link
Contributor Author

Should I retest this?

@chenghao-intel
Copy link
Contributor Author

test this please.

@liancheng
Copy link
Contributor

HiveThriftServer2Suite timeout is also fixed, please refer to #4484.

@liancheng
Copy link
Contributor

Yeah, you should retest.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27143 has started for PR 4382 at commit 403d6ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27143 has finished for PR 4382 at commit 403d6ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27143/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

@liancheng Seems the HiveThriftServer2Suite still not be triggered.
test this please.

@liancheng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27158 has started for PR 4382 at commit 403d6ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27158 has finished for PR 4382 at commit 403d6ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27158/
Test PASSed.

@liancheng
Copy link
Contributor

@chenghao-intel The biggest issue I see in this PR is that users can no longer share cached tables, which breaks many existing use scenarios. I'm afraid it may require major efforts to support both multi-session and cache sharing. I can think of two alternatives:

  1. All sessions share a single HiveContext to enable cache sharing, and provide one SQLConf per session to isolate session configurations.
  2. Provide one HiveContext per session to isolate session configuration, but extract cached table meta data management out of SQLContext / HiveContext to some global place, so that different HiveContext can figure out the mapping between cached tables (actually query plans) and their underlying materialized in-memory columnar RDDs).

@chenghao-intel
Copy link
Contributor Author

Yeah, I can understand people want to share the cached table among multi-sessions, is there any potential requirement that people just want to keep the 'temp' table visibility within the HiveContext (will not share with the other session)?
Probably the option 2 is a better if we want to support the feature above in the future.
Anyway, thanks for reviewing, I will update the code.

@guowei2
Copy link
Contributor

guowei2 commented Feb 12, 2015

I agree with liancheng, user want to share cached tables in many cases.
I still think isolate all resources in HiveContext is not a good idea.
Whether to use cached table created by different user is the responsibility the authority does.
we should not always keep them invisibility
I still think SessionState ThreadLocal is enough. and hiveconf can get from SessionState
This is what hive does with thriftServer

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27326 has started for PR 4382 at commit 73922ae.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

I agree with @liancheng, too, both code and description are updated.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27326 has finished for PR 4382 at commit 73922ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait DataFrame extends RDDApi[Row] with Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27326/
Test FAILed.

@chenghao-intel
Copy link
Contributor Author

Seems failure due to the irrelevant code.

@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27328 has started for PR 4382 at commit 73922ae.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27328 has finished for PR 4382 at commit 73922ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27328/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27501 has started for PR 4382 at commit 9d3b296.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27501 has finished for PR 4382 at commit 9d3b296.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27501/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 22, 2015

Test build #27830 has started for PR 4382 at commit 197e806.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 22, 2015

Test build #27830 has finished for PR 4382 at commit 197e806.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27830/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

/cc @liancheng can you review this for me?

@chenghao-intel
Copy link
Contributor Author

/cc @liancheng @marmbrus can you give some high level comments? and then I can start the rebasing (again)

@liancheng
Copy link
Contributor

Hey @chenghao-intel, thanks for working on this, AFAIK this is a pain point for many Spark SQL users who would like to put HiveThriftServer2 into production. Also had a discussion with @marmbrus about this recently.

As we've discussed offline, instead of changing CacheManager and FunctionRegistry to global instances, adding a SQLSession and moving per-session objects (configurations, temporary functions, etc.) to it could be more preferrable. To be more specific:

  1. Add a new SQLSession class, which is responsible to maintain all per-session objects, like configurations, temporary functions, etc..

  2. Add a session field of type SQLSession in SQLContext, and override it in HiveContext, then put Hive specific per-session objects into it, like Hive client, Hive session state, etc..

  3. Add the following session specific methods to SQLContext:

    • createSession: SQLSession
    • currentSession: SQLSession
    • setSession(session: SQLSession): Unit
    • closeSession(session: SQLSession)

    These methods should be private[sql] as they are subject to change. Currently we can just mimic Hive behavior, for example, using thread-local instances just like what Hive session state does. (You may see the SQLSession object within HiveContext a thin wrapper of SessionState together with other per-session components.)

The benefits of this approach are:

  1. In the long run, we'd like to move HiveContext out of the main framework and make Hive a separate data source. With the above approach, it's more natural to build a separate Spark SQL server with multi-user support without depending on Hive specific code.
  2. Making components like CacheManager global objects are not test friendly. Basically it's impossible to make Spark SQL tests run in parallel.

@chenghao-intel
Copy link
Contributor Author

@liancheng thank you very much for the so detailed comment! Actually I am quite fighting with the 2 approaches:
Single HiveContext Instance (with the thread local of SQLSession) V.S. Multiple HiveContext Instances(with global CacheManager and Hive Metastore instance etc.)

The main reasons I choice the later approach is people can create arbitrary number of HiveContext or SQLContext instances when using DF API, and then they may reference the created instances within the same thread, in ThreadLocal(former one) approach, it still will causes unpredictable behavior, unless we make the HiveContext or SQLContext as global unique instance, but this is not we want, right?

Whatever approach we take, it's intuitive that the CacheManager, HiveMetaStore, FunctionRegistry can be Context independent and thread-safe, and we can put the Context dependent information into SQLContext or HiveContext as property, like the sqlconf and sessionState.

What do you think?

@tianyi
Copy link
Contributor

tianyi commented Mar 3, 2015

@chenghao-intel I'm little confusing about why people have to create multiple HiveContext or SQLContext instances in the same thread. Could u provide some cases?

@liancheng
Copy link
Contributor

@chenghao-intel I'm posting the summary of our offline discussion here for future reference:

SQLContext.CacheManager maps to SparkContext.CacheManager in a one-to-one manner. Once the SQL cache manager is made a global object, users can only use a single SparkContext in a single process. This limitation is neither wanted nor necessary. On the other hand, moving session specific stuff to SQLSession and leaving stuff shared by multiple sessions in SQLContext/HiveContext avoids this problem.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28377 has started for PR 4382 at commit 197e806.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28377 has finished for PR 4382 at commit 197e806.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28377/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

Closing it since #4885 has been merged.

@chenghao-intel chenghao-intel deleted the multisessions branch July 2, 2015 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants