Skip to content

Conversation

@xleoken
Copy link
Member

@xleoken xleoken commented Mar 20, 2024

What changes were proposed in this pull request?

Add HiveDialect to sql module

Why are the changes needed?

In scenarios with multiple hive catalogs, throw ParseException

SQL

bin/spark-sql \
  --conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
  --conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
  --conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
  --conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"

select count(1) from aaa.data.data_part;

Exception

24/03/19 21:58:25 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/f15a5434-6356-455b-aa8e-4ce9903c1b81
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Running query with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO DAGScheduler: Asked to cancel job group a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 ERROR SparkExecuteStatementOperation: Error executing query with a7459d6d-2a5c-4b56-945c-3159e58d12fd, currentState RUNNING, 
org.apache.spark.sql.catalyst.parser.ParseException: 
Syntax error at or near '"data"'(line 1, pos 14)

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)

Does this PR introduce any user-facing change?

no

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand your proposal, I'm not sure HiveDialect is a valid name in Apache Spark community because Apache Spark ThriftServer also uses jdbc:hive2. You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right?

Copy link
Member Author

@xleoken xleoken Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @dongjoon-hyun, thanks for your review.
Your considerations are correct, but this patch is applicable to both Hive Thrift Server and Spart Thrift Server.

You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right?

Actually, it's not. I used sbin/start-thriftserver.sh in the production environment.

I'm not sure HiveDialect is a valid name in Apache Spark community

OK, HiveDialect seems better for jdbc:hive2. In the future, if encountering Hive-specific syntax or SparkSQL-specific syntax issue, we can distinguish between Hive and Spark in specific methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are trying to use this for Spark Thrift Server, this should be SparkDialect in Spark community. However, in that case, it will look very weird because Apache Spark needs a direct to access itself. That's the meaning why we don't want to add any SparkDialect or HiveDialect.

Actually, it's not. I used sbin/start-thriftserver.sh in the production environment.

Copy link
Member Author

@xleoken xleoken Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dongjoon-hyun, we want to query data from two independent data centers, so we use multiple spark jdbc catalogs.

image

@dongjoon-hyun
Copy link
Member

cc @yaooqinn , too

@yaooqinn
Copy link
Member

Thanks for pinging me @dongjoon-hyun.

I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect.

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

Thanks for pinging me @dongjoon-hyun.

I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect.

Welcome @yaooqinn, can you explain in detail? The key to this patch is to override quoteIdentifier method.

@HyukjinKwon
Copy link
Member

For the record, it was rejected at #19238 and #4015

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

For the record, it was rejected at #19238 and #4015

Hi @HyukjinKwon, I can not understand why it was rejected.

The following sql will run failed without HiveDialect.

bin/spark-sql \
  --conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
  --conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
  --conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
  --conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"

select count(1) from aaa.data.data_part;

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 21, 2024

Not all Drivers can to be built-in support at Apache Spark. It can be provided as a thridparty library

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

Not all Drivers can to be built-in support at Apache Spark. It can be provided as a thridparty library

But we know, spark thrift server is based on hive thrift server, so HiveDialect better be built-in support at Apache Spark.

By the way, how to provided as a thridparty library? Put only one HiveDialect scala file in jar file? it seems not friendly.

#19238 #19238 met the same issue.

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

@HyukjinKwon how about change JdbcDialects#quoteIdentifier, let it return s"`$colName`" instead of s""""$colName"""".

Most DBs can't parse "data"."data_part" statement.

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)

@beliefer
Copy link
Contributor

I think Spark already supported visit Hive. It's the mainstream approach.

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

I think Spark already supported visit Hive. It's the mainstream approach.

Hi @beliefer, we want to query data from two independent data centers, so we use multiple jdbc catalogs.

image

@beliefer
Copy link
Contributor

@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.

This is too heavy for users and there's no need for it.

As Daniel Fernandez said, only two functions should be overriden. in https://issues.apache.org/jira/browse/SPARK-22016

https://issues.apache.org/jira/browse/SPARK-21063
https://issues.apache.org/jira/browse/SPARK-22016
https://issues.apache.org/jira/browse/SPARK-31457

@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

cc @MrDLontheway @danielfx90, can you share your idea, thanks

@yaooqinn
Copy link
Member

Just FYI, SPARK-47496 makes loading a custom dialect much easier.

@dongjoon-hyun
Copy link
Member

Thank you, @xleoken and all.

Let me close this PR first to prevent accidental merging. We can continue to discuss on this PR (or old PRs, #19238 and #4015).

pan3793 added a commit to apache/kyuubi that referenced this pull request Apr 23, 2025
…alect

### Why are the changes needed?

This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified.

The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow.

Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible.

### How was this patch tested?

It's a docs-only change, review is required.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7036 from pan3793/jdbc-ds-docs.

Closes #7036

c00ce07 [Cheng Pan] style
f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
pan3793 added a commit to apache/kyuubi that referenced this pull request Apr 23, 2025
…alect

### Why are the changes needed?

This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified.

The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow.

Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible.

### How was this patch tested?

It's a docs-only change, review is required.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7036 from pan3793/jdbc-ds-docs.

Closes #7036

c00ce07 [Cheng Pan] style
f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit 6da0e62)
Signed-off-by: Cheng Pan <[email protected]>
turboFei pushed a commit to turboFei/kyuubi that referenced this pull request Aug 27, 2025
…dbc-dialect

### Why are the changes needed?

This PR removes the page https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html and merges the most content into https://kyuubi.readthedocs.io/en/v1.10.1/extensions/engines/spark/jdbc-dialect.html, some original content of the latter is also modified.

The current docs are misleading, I got asked several times by users why they follow the [Kyuubi PySpark docs](https://kyuubi.readthedocs.io/en/v1.10.1/client/python/pyspark.html) to access data stored in Hive warehouse is too slow.

Actually, accessing HiveServer2/STS from Spark JDBC data source is discouraged by the Spark community, see [SPARK-47482](apache/spark#45609), even though it's technical feasible.

### How was this patch tested?

It's a docs-only change, review is required.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#7036 from pan3793/jdbc-ds-docs.

Closes apache#7036

c00ce07 [Cheng Pan] style
f2676bd [Cheng Pan] [DOCS] Improve docs for kyuubi-extension-spark-jdbc-dialect

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants