-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-47482] Add HiveDialect to sql module #45609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I understand your proposal, I'm not sure HiveDialect is a valid name in Apache Spark community because Apache Spark ThriftServer also uses jdbc:hive2. You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @dongjoon-hyun, thanks for your review.
Your considerations are correct, but this patch is applicable to both Hive Thrift Server and Spart Thrift Server.
You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right?
Actually, it's not. I used sbin/start-thriftserver.sh in the production environment.
I'm not sure HiveDialect is a valid name in Apache Spark community
OK, HiveDialect seems better for jdbc:hive2. In the future, if encountering Hive-specific syntax or SparkSQL-specific syntax issue, we can distinguish between Hive and Spark in specific methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are trying to use this for Spark Thrift Server, this should be SparkDialect in Spark community. However, in that case, it will look very weird because Apache Spark needs a direct to access itself. That's the meaning why we don't want to add any SparkDialect or HiveDialect.
Actually, it's not. I used sbin/start-thriftserver.sh in the production environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dongjoon-hyun, we want to query data from two independent data centers, so we use multiple spark jdbc catalogs.
|
cc @yaooqinn , too |
|
Thanks for pinging me @dongjoon-hyun. I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect. |
Welcome @yaooqinn, can you explain in detail? The key to this patch is to override |
|
Hi @HyukjinKwon, I can not understand why it was rejected. The following sql will run failed without |
|
Not all Drivers can to be built-in support at Apache Spark. It can be provided as a thridparty library |
But we know, spark thrift server is based on hive thrift server, so By the way, how to provided as a thridparty library? Put only one |
|
@HyukjinKwon how about change Most DBs can't parse |
|
I think Spark already supported visit Hive. It's the mainstream approach. |
Hi @beliefer, we want to query data from two independent data centers, so we use multiple jdbc catalogs. |
|
@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects. |
This is too heavy for users and there's no need for it. As Daniel Fernandez said, only two functions should be overriden. in https://issues.apache.org/jira/browse/SPARK-22016 https://issues.apache.org/jira/browse/SPARK-21063 |
|
cc @MrDLontheway @danielfx90, can you share your idea, thanks |
|
Just FYI, SPARK-47496 makes loading a custom dialect much easier. |
…alect ### Why are the changes needed? This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified. The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow. Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible. ### How was this patch tested? It's a docs-only change, review is required. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #7036 from pan3793/jdbc-ds-docs. Closes #7036 c00ce07 [Cheng Pan] style f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>
…alect ### Why are the changes needed? This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified. The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow. Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible. ### How was this patch tested? It's a docs-only change, review is required. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #7036 from pan3793/jdbc-ds-docs. Closes #7036 c00ce07 [Cheng Pan] style f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit 6da0e62) Signed-off-by: Cheng Pan <[email protected]>
…dbc-dialect ### Why are the changes needed? This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified. The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow. Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible. ### How was this patch tested? It's a docs-only change, review is required. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#7036 from pan3793/jdbc-ds-docs. Closes apache#7036 c00ce07 [Cheng Pan] style f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>


What changes were proposed in this pull request?
Add HiveDialect to sql module
Why are the changes needed?
In scenarios with multiple hive catalogs, throw
ParseExceptionSQL
Exception
Does this PR introduce any user-facing change?
no
How was this patch tested?
local test
Was this patch authored or co-authored using generative AI tooling?
no