[SPARK-46043][SQL] Support create table using DSv2 sources #43949

allisonwang-db · 2023-11-22T03:42:13Z

What changes were proposed in this pull request?

This PR supports CREATE TABLE ... USING source for DSv2 sources.

Why are the changes needed?

To support creating DSv2 tables in SQL. Currently the table create can work but when you select a dsv2 table created in SQL, it fails with this error:

org.apache.spark.sql.AnalysisException: org.apache.spark.sql.connector.SimpleDataSourceV2 is not a valid Spark SQL Data Source.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

allisonwang-db · 2023-11-22T03:47:50Z

cc @cloud-fan

cloud-fan · 2023-11-22T06:26:23Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

how does this work? empty table schema?

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

beliefer · 2023-11-28T08:47:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala

case p: Some(v) if !v.isInstanceOf[FileDataSourceV2] => p case _ => None

beliefer · 2023-11-28T08:50:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala

Because isV2Provider is only used for ResolveSessionCatalog, move this back to ResolveSessionCatalog.

Good catch!

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

allisonwang-db · 2023-11-28T17:51:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

We can pass in the catalogManager and skip this check. And user cannot provide schema here.

@cloud-fan Actually no. CatalogManager constructor takes in a v2SessionCatalog, and here we can't pass in the catalog manager to the constructor of v2 session catalog (circular dependency):

spark/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

Lines 174 to 176 in 7a0d041

protected lazy val v2SessionCatalog = new V2SessionCatalog(catalog)

protected lazy val catalogManager = new CatalogManager(v2SessionCatalog, catalog)

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

beliefer

Looks good!

beliefer · 2023-11-29T13:14:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SimpleTableProvider.scala

Please revert this line.

beliefer · 2023-11-29T13:19:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

Suggested change

val tableProperties = table.properties

val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))

val properties = tableProperties ++ pathOption

val properties = table.properties ++

table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))

beliefer

LGTM. cc @cloud-fan @huaxingao

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SimpleTableProvider.scala

cloud-fan · 2023-11-30T19:59:16Z

common/utils/src/main/resources/error/error-classes.json

  },
+  "CANNOT_CREATE_DATA_SOURCE_V2_TABLE" : {
+    "message" : [
+      "Failed to create data source V2 table:"


shall we include the table name?

cloud-fan · 2023-11-30T20:03:41Z

common/utils/src/main/resources/error/error-classes.json

    ],
    "sqlState" : "42846"
  },
+  "CANNOT_CREATE_DATA_SOURCE_V2_TABLE" : {


I can't find other errors that mention data source v2. I think it's a developer thing and we should not expose it to end users via error message. How about just CANNOT_CREATE_TABLE?

cloud-fan · 2023-11-30T20:04:27Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+    import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+    (catalog, identifier) match {
+      case (Some(cat), Some(ident)) => s"${quoteIdentifier(cat.name())}.${ident.quoted}"
+      case (None, Some(ident)) => ident.quoted


I don't think this can happen. We can add an assert.

cloud-fan · 2023-11-30T20:06:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

+        }
+
+      case _ =>
+        (schema, partitions)


shall we fail here if it's not a valid data source?

maybe we can do it latter. It's the current behavior that allows any table provider.

cloud-fan · 2023-11-30T20:07:53Z

sql/core/src/test/scala/org/apache/spark/sql/connector/SupportsCatalogOptionsSuite.scala

 import org.scalatest.BeforeAndAfter

-import org.apache.spark.SparkException
+import org.apache.spark.{SparkException, SparkUnsupportedOperationException}


unnecessary change?

allisonwang-db · 2023-12-01T04:36:47Z

The test failure seems unrelated.

cloud-fan · 2023-12-01T10:02:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

+
+      case Some(tableProvider) =>
+        assert(tableProvider.supportsExternalMetadata())
+        lazy val dsOptions = new CaseInsensitiveStringMap(properties)


do we need to put the path option?

I think we can add a new method to create ds options from a CatalogTable, to save duplicated code.

cloud-fan · 2023-12-01T10:04:03Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

 }

 /** Used as a V2 DataSource for V2SessionCatalog DDL */
 class FakeV2Provider extends SimpleTableProvider {


Can we avoid extending SimpleTableProvider here? I think it's not meant to support external metadata.

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

cloud-fan · 2023-12-01T10:09:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

+        if (schema.nonEmpty) {
+          throw new SparkUnsupportedOperationException(
+            errorClass = "CANNOT_CREATE_DATA_SOURCE_TABLE.EXTERNAL_METADATA_UNSUPPORTED",
+            messageParameters = Map("tableName" -> ident.quoted, "provider" -> provider))


ident.quoted only quotes when necessary, but in error message, we require fully quoted.

You can call toSQLId(ident.asMultipartIdentifier), but maybe it's better to add a def fullyQuoted in implicit class IdentifierHelper and use it here.

cloud-fan · 2023-12-01T10:11:02Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

+
+  test("SPARK-46043: create table in SQL with schema required data source") {
+    val cls = classOf[SchemaRequiredDataSource]
+    val e = intercept[IllegalArgumentException] {


oh, it doesn't have an error class?

nvm, https://github.com/apache/spark/pull/43949/files#r1411887364

cloud-fan · 2023-12-01T10:11:49Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

+
+  test("SPARK-46043: create table in SQL with partitioning required data source") {
+    val cls = classOf[PartitionsRequiredDataSource]
+    val e = intercept[IllegalArgumentException](


oh it's thrown directly from the data source?

cloud-fan · 2023-12-01T10:15:47Z

sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala

    verifyTable(t1, Seq.empty[(Long, String, String)].toDF("id", "data", "missing"))
-    val tableName = if (catalogAndNamespace.isEmpty) toSQLId(s"default.$t1") else toSQLId(t1)
+    val tableName = if (catalogAndNamespace.isEmpty) {
+      toSQLId(s"spark_catalog.default.$t1")


not related to your PR but this seems to indicate a bug. so the error message points to table "`spark_catalog.default.t1`"? cc @MaxGekk

cloud-fan · 2023-12-04T07:13:25Z

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+      case (Some(cat), Some(ident)) => s"${quoteIfNeeded(cat.name())}.${ident.quoted}"
+      case (None, None) => table.name()
+      case _ =>
+        throw new IllegalArgumentException(


this should be SparkException.internalError

cloud-fan · 2023-12-05T01:33:56Z

thanks, merging to master!

### What changes were proposed in this pull request? This PR supports `CREATE TABLE ... USING source` for DSv2 sources. ### Why are the changes needed? To support creating DSv2 tables in SQL. Currently the table create can work but when you select a dsv2 table created in SQL, it fails with this error: ``` org.apache.spark.sql.AnalysisException: org.apache.spark.sql.connector.SimpleDataSourceV2 is not a valid Spark SQL Data Source. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43949 from allisonwang-db/spark-46043-dsv2-create-table. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…in DataSourceV2Relation ### What changes were proposed in this pull request? #43949 added a check in the `name` method of `DataSourceV2Relation`, which can be overly strict. This PR removes the check and revert to use `table.name()` when either catalog or identifier is empty. ### Why are the changes needed? To reduce the chance of having breaking changes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44348 from allisonwang-db/spark-46043-followup. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? #43949 supports CREATE TABLE using DSv2 sources. This PR supports CREATE TABLE AS SELECT (CTAS) using DSv2 sources. It turns out that we don't need additional code changes. This PR simply adds more test cases for CTAS queries. ### Why are the changes needed? To add tests for CTAS for DSv2 sources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44190 from allisonwang-db/spark-46272-ctas. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…stom session catalog ### What changes were proposed in this pull request? This is a follow-up of #43949 to fix a breaking change. Spark allows people to provide a custom session catalog, which may return custom v2 tables based on the table provider. #43949 resolves the table provider earlier than the custom session catalog, and may break custom session catalogs. This PR fixes it by not resolving table provider if custom session catalog is present. ### Why are the changes needed? avoid breaking custom session catalogs ### Does this PR introduce _any_ user-facing change? no, #43949 is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45440 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Kent Yao <[email protected]>

github-actions bot added the SQL label Nov 22, 2023

allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from 03bf5a4 to b77c505 Compare November 22, 2023 03:47

cloud-fan reviewed Nov 22, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala Outdated

Copy link

Contributor

cloud-fan Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work? empty table schema?

cloud-fan reviewed Nov 22, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 22, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala Outdated Show resolved Hide resolved

allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from c2bbb4a to 81fff6a Compare November 28, 2023 07:29

github-actions bot added the DOCS label Nov 28, 2023

beliefer reviewed Nov 28, 2023

View reviewed changes

allisonwang-db commented Nov 28, 2023

View reviewed changes

allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from a3d687e to 6fdbe6c Compare November 28, 2023 21:24

beliefer reviewed Nov 29, 2023

View reviewed changes

allisonwang-db added 5 commits November 29, 2023 15:06

create table

fc1249d

update

b0126b8

address comments

9b28f92

address comments

ac74a36

fix tests

572fea0

allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from 51b6581 to 572fea0 Compare November 29, 2023 23:07

fix

fa5243b

beliefer approved these changes Nov 30, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SimpleTableProvider.scala Show resolved Hide resolved

cloud-fan reviewed Nov 30, 2023

View reviewed changes

allisonwang-db added 2 commits November 30, 2023 13:22

fix style

8575ce0

address comments

23177fc

cloud-fan reviewed Dec 1, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 1, 2023

View reviewed changes

address comments

75e368b

cloud-fan reviewed Dec 4, 2023

View reviewed changes

cloud-fan approved these changes Dec 4, 2023

View reviewed changes

address comments

422d65e

cloud-fan closed this in 5fec76d Dec 5, 2023

allisonwang-db mentioned this pull request Dec 5, 2023

[SPARK-46272][SQL] Support CTAS using DSv2 sources #44190

Closed

This was referenced Dec 6, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Closed

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

Closed

allisonwang-db mentioned this pull request Dec 14, 2023

[SPARK-46043][SQL][FOLLOWUP] Remove the catalog and identifier check in DataSourceV2Relation #44348

Closed

cloud-fan mentioned this pull request Mar 8, 2024

[SPARK-46043][SQL][FOLLOWUP] do not resolve v2 table provider with custom session catalog #45440

Closed

	protected lazy val v2SessionCatalog = new V2SessionCatalog(catalog)

	protected lazy val catalogManager = new CatalogManager(v2SessionCatalog, catalog)

-            val tableProperties = table.properties
-            val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))
-            val properties = tableProperties ++ pathOption
+            val properties = table.properties ++
+              table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))

[SPARK-46043][SQL] Support create table using DSv2 sources #43949

[SPARK-46043][SQL] Support create table using DSv2 sources #43949

Uh oh!

Conversation

allisonwang-db commented Nov 22, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db commented Nov 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db commented Dec 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 5, 2023

Uh oh!

Reviewers

Assignees

Labels