[SPARK-47482] Add HiveDialect to sql module #45609

xleoken · 2024-03-20T08:31:45Z

What changes were proposed in this pull request?

Add HiveDialect to sql module

Why are the changes needed?

In scenarios with multiple hive catalogs, throw ParseException

SQL

bin/spark-sql \
  --conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
  --conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
  --conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
  --conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"

select count(1) from aaa.data.data_part;

Exception

24/03/19 21:58:25 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/f15a5434-6356-455b-aa8e-4ce9903c1b81
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Running query with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO DAGScheduler: Asked to cancel job group a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 ERROR SparkExecuteStatementOperation: Error executing query with a7459d6d-2a5c-4b56-945c-3159e58d12fd, currentState RUNNING, 
org.apache.spark.sql.catalyst.parser.ParseException: 
Syntax error at or near '"data"'(line 1, pos 14)

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)

Does this PR introduce any user-facing change?

no

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun · 2024-03-20T17:32:23Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala

Although I understand your proposal, I'm not sure HiveDialect is a valid name in Apache Spark community because Apache Spark ThriftServer also uses jdbc:hive2. You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right?

hi @dongjoon-hyun, thanks for your review.
Your considerations are correct, but this patch is applicable to both Hive Thrift Server and Spart Thrift Server.

You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right?

Actually, it's not. I used sbin/start-thriftserver.sh in the production environment.

I'm not sure HiveDialect is a valid name in Apache Spark community

OK, HiveDialect seems better for jdbc:hive2. In the future, if encountering Hive-specific syntax or SparkSQL-specific syntax issue, we can distinguish between Hive and Spark in specific methods.

If you are trying to use this for Spark Thrift Server, this should be SparkDialect in Spark community. However, in that case, it will look very weird because Apache Spark needs a direct to access itself. That's the meaning why we don't want to add any SparkDialect or HiveDialect.

Actually, it's not. I used sbin/start-thriftserver.sh in the production environment.

Hi @dongjoon-hyun, we want to query data from two independent data centers, so we use multiple spark jdbc catalogs.

dongjoon-hyun · 2024-03-20T17:32:37Z

cc @yaooqinn , too

yaooqinn · 2024-03-21T02:03:44Z

Thanks for pinging me @dongjoon-hyun.

I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect.

xleoken · 2024-03-21T02:25:18Z

Thanks for pinging me @dongjoon-hyun.

I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect.

Welcome @yaooqinn, can you explain in detail? The key to this patch is to override quoteIdentifier method.

HyukjinKwon · 2024-03-21T07:08:14Z

For the record, it was rejected at #19238 and #4015

xleoken · 2024-03-21T07:16:45Z

For the record, it was rejected at #19238 and #4015

Hi @HyukjinKwon, I can not understand why it was rejected.

The following sql will run failed without HiveDialect.

bin/spark-sql \
  --conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
  --conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
  --conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
  --conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"

select count(1) from aaa.data.data_part;

HyukjinKwon · 2024-03-21T07:19:15Z

Not all Drivers can to be built-in support at Apache Spark. It can be provided as a thridparty library

xleoken · 2024-03-21T07:27:00Z

Not all Drivers can to be built-in support at Apache Spark. It can be provided as a thridparty library

But we know, spark thrift server is based on hive thrift server, so HiveDialect better be built-in support at Apache Spark.

By the way, how to provided as a thridparty library? Put only one HiveDialect scala file in jar file? it seems not friendly.

#19238 #19238 met the same issue.

xleoken · 2024-03-21T07:31:12Z

@HyukjinKwon how about change JdbcDialects#quoteIdentifier, let it return s"`$colName`" instead of s""""$colName"""".

Most DBs can't parse "data"."data_part" statement.

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)

beliefer · 2024-03-21T09:23:28Z

I think Spark already supported visit Hive. It's the mainstream approach.

xleoken · 2024-03-21T09:27:54Z

I think Spark already supported visit Hive. It's the mainstream approach.

Hi @beliefer, we want to query data from two independent data centers, so we use multiple jdbc catalogs.

beliefer · 2024-03-21T09:31:21Z

@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.

xleoken · 2024-03-21T09:44:45Z

@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.

This is too heavy for users and there's no need for it.

As Daniel Fernandez said, only two functions should be overriden. in https://issues.apache.org/jira/browse/SPARK-22016

https://issues.apache.org/jira/browse/SPARK-21063
https://issues.apache.org/jira/browse/SPARK-22016
https://issues.apache.org/jira/browse/SPARK-31457

xleoken · 2024-03-21T10:00:47Z

cc @MrDLontheway @danielfx90, can you share your idea, thanks

yaooqinn · 2024-03-21T12:27:56Z

Just FYI, SPARK-47496 makes loading a custom dialect much easier.

dongjoon-hyun · 2024-03-21T14:58:02Z

Thank you, @xleoken and all.

Let me close this PR first to prevent accidental merging. We can continue to discuss on this PR (or old PRs, #19238 and #4015).

…alect ### Why are the changes needed? This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified. The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow. Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible. ### How was this patch tested? It's a docs-only change, review is required. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #7036 from pan3793/jdbc-ds-docs. Closes #7036 c00ce07 [Cheng Pan] style f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

…alect ### Why are the changes needed? This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified. The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow. Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible. ### How was this patch tested? It's a docs-only change, review is required. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #7036 from pan3793/jdbc-ds-docs. Closes #7036 c00ce07 [Cheng Pan] style f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit 6da0e62) Signed-off-by: Cheng Pan <[email protected]>

…dbc-dialect ### Why are the changes needed? This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified. The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow. Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible. ### How was this patch tested? It's a docs-only change, review is required. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#7036 from pan3793/jdbc-ds-docs. Closes apache#7036 c00ce07 [Cheng Pan] style f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

github-actions bot added the SQL label Mar 20, 2024

xleoken force-pushed the patch branch from c618c92 to ff664bc Compare March 20, 2024 12:14

dongjoon-hyun requested changes Mar 20, 2024

View reviewed changes

xleoken force-pushed the patch branch from 64f046b to 47d3959 Compare March 21, 2024 12:24

[SPARK-47482] Add HiveDialect to sql module

f0c2623

xleoken force-pushed the patch branch from 47d3959 to f0c2623 Compare March 21, 2024 12:25

dongjoon-hyun closed this Mar 21, 2024

dongjoon-hyun mentioned this pull request Mar 21, 2024

[SPARK-47482] Add HiveDialect to sql module #45644

Closed

pan3793 mentioned this pull request Apr 21, 2025

[DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect apache/kyuubi#7036

Closed

[SPARK-47482] Add HiveDialect to sql module #45609

[SPARK-47482] Add HiveDialect to sql module #45609

Uh oh!

Conversation

xleoken commented Mar 20, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Mar 20, 2024

Choose a reason for hiding this comment

Uh oh!

xleoken Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

xleoken Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

yaooqinn commented Mar 21, 2024

Uh oh!

xleoken commented Mar 21, 2024

Uh oh!

HyukjinKwon commented Mar 21, 2024

Uh oh!

xleoken commented Mar 21, 2024

Uh oh!

HyukjinKwon commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xleoken commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xleoken commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beliefer commented Mar 21, 2024

Uh oh!

xleoken commented Mar 21, 2024

Uh oh!

beliefer commented Mar 21, 2024

Uh oh!

xleoken commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xleoken commented Mar 21, 2024

Uh oh!

yaooqinn commented Mar 21, 2024

Uh oh!

dongjoon-hyun commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xleoken Mar 21, 2024 •

edited

Loading

xleoken Mar 21, 2024 •

edited

Loading

HyukjinKwon commented Mar 21, 2024 •

edited

Loading

xleoken commented Mar 21, 2024 •

edited

Loading

xleoken commented Mar 21, 2024 •

edited

Loading

xleoken commented Mar 21, 2024 •

edited

Loading