Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR supports CREATE TABLE ... USING source for DSv2 sources.

Why are the changes needed?

To support creating DSv2 tables in SQL. Currently the table create can work but when you select a dsv2 table created in SQL, it fails with this error:

org.apache.spark.sql.AnalysisException: org.apache.spark.sql.connector.SimpleDataSourceV2 is not a valid Spark SQL Data Source.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 22, 2023
@allisonwang-db allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from 03bf5a4 to b77c505 Compare November 22, 2023 03:47
@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work? empty table schema?

@allisonwang-db allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from c2bbb4a to 81fff6a Compare November 28, 2023 07:29
@github-actions github-actions bot added the DOCS label Nov 28, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case p: Some(v) if !v.isInstanceOf[FileDataSourceV2] => p
case _ => None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because isV2Provider is only used for ResolveSessionCatalog, move this back to ResolveSessionCatalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines 169 to 172
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass in the catalogManager and skip this check. And user cannot provide schema here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Actually no. CatalogManager constructor takes in a v2SessionCatalog, and here we can't pass in the catalog manager to the constructor of v2 session catalog (circular dependency):

protected lazy val v2SessionCatalog = new V2SessionCatalog(catalog)
protected lazy val catalogManager = new CatalogManager(v2SessionCatalog, catalog)

@allisonwang-db allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from a3d687e to 6fdbe6c Compare November 28, 2023 21:24
Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this line.

Comment on lines 83 to 85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val tableProperties = table.properties
val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))
val properties = tableProperties ++ pathOption
val properties = table.properties ++
table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))

@allisonwang-db allisonwang-db force-pushed the spark-46043-dsv2-create-table branch from 51b6581 to 572fea0 Compare November 29, 2023 23:07
Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

},
"CANNOT_CREATE_DATA_SOURCE_V2_TABLE" : {
"message" : [
"Failed to create data source V2 table:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we include the table name?

],
"sqlState" : "42846"
},
"CANNOT_CREATE_DATA_SOURCE_V2_TABLE" : {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find other errors that mention data source v2. I think it's a developer thing and we should not expose it to end users via error message. How about just CANNOT_CREATE_TABLE?

import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
(catalog, identifier) match {
case (Some(cat), Some(ident)) => s"${quoteIdentifier(cat.name())}.${ident.quoted}"
case (None, Some(ident)) => ident.quoted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can happen. We can add an assert.

}

case _ =>
(schema, partitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we fail here if it's not a valid data source?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can do it latter. It's the current behavior that allows any table provider.

import org.scalatest.BeforeAndAfter

import org.apache.spark.SparkException
import org.apache.spark.{SparkException, SparkUnsupportedOperationException}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change?

@allisonwang-db
Copy link
Contributor Author

The test failure seems unrelated.


case Some(tableProvider) =>
assert(tableProvider.supportsExternalMetadata())
lazy val dsOptions = new CaseInsensitiveStringMap(properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to put the path option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a new method to create ds options from a CatalogTable, to save duplicated code.

}

/** Used as a V2 DataSource for V2SessionCatalog DDL */
class FakeV2Provider extends SimpleTableProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid extending SimpleTableProvider here? I think it's not meant to support external metadata.

if (schema.nonEmpty) {
throw new SparkUnsupportedOperationException(
errorClass = "CANNOT_CREATE_DATA_SOURCE_TABLE.EXTERNAL_METADATA_UNSUPPORTED",
messageParameters = Map("tableName" -> ident.quoted, "provider" -> provider))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ident.quoted only quotes when necessary, but in error message, we require fully quoted.

You can call toSQLId(ident.asMultipartIdentifier), but maybe it's better to add a def fullyQuoted in implicit class IdentifierHelper and use it here.


test("SPARK-46043: create table in SQL with schema required data source") {
val cls = classOf[SchemaRequiredDataSource]
val e = intercept[IllegalArgumentException] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it doesn't have an error class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


test("SPARK-46043: create table in SQL with partitioning required data source") {
val cls = classOf[PartitionsRequiredDataSource]
val e = intercept[IllegalArgumentException](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it's thrown directly from the data source?

verifyTable(t1, Seq.empty[(Long, String, String)].toDF("id", "data", "missing"))
val tableName = if (catalogAndNamespace.isEmpty) toSQLId(s"default.$t1") else toSQLId(t1)
val tableName = if (catalogAndNamespace.isEmpty) {
toSQLId(s"spark_catalog.default.$t1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to your PR but this seems to indicate a bug. so the error message points to table "`spark_catalog.default.t1`"? cc @MaxGekk

case (Some(cat), Some(ident)) => s"${quoteIfNeeded(cat.name())}.${ident.quoted}"
case (None, None) => table.name()
case _ =>
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be SparkException.internalError

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 5fec76d Dec 5, 2023
dbatomic pushed a commit to dbatomic/spark that referenced this pull request Dec 11, 2023
### What changes were proposed in this pull request?

This PR supports `CREATE TABLE ... USING source` for DSv2 sources.

### Why are the changes needed?

To support creating DSv2 tables in SQL. Currently the table create can work but when you select a dsv2 table created in SQL, it fails with this error:
```
org.apache.spark.sql.AnalysisException: org.apache.spark.sql.connector.SimpleDataSourceV2 is not a valid Spark SQL Data Source.
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#43949 from allisonwang-db/spark-46043-dsv2-create-table.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Dec 14, 2023
…in DataSourceV2Relation

### What changes were proposed in this pull request?

#43949 added a check in the `name` method of `DataSourceV2Relation`, which can be overly strict. This PR removes the check and revert to use `table.name()` when either catalog or identifier is empty.

### Why are the changes needed?

To reduce the chance of having breaking changes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44348 from allisonwang-db/spark-46043-followup.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Dec 20, 2023
### What changes were proposed in this pull request?

#43949 supports CREATE TABLE using DSv2 sources. This PR supports CREATE TABLE AS SELECT (CTAS) using DSv2 sources. It turns out that we don't need additional code changes. This PR simply adds more test cases for CTAS queries.

### Why are the changes needed?

To add tests for CTAS for DSv2 sources.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44190 from allisonwang-db/spark-46272-ctas.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
yaooqinn pushed a commit that referenced this pull request Mar 12, 2024
…stom session catalog

### What changes were proposed in this pull request?

This is a follow-up of #43949 to fix a breaking change. Spark allows people to provide a custom session catalog, which may return custom v2 tables based on the table provider. #43949 resolves the table provider earlier than the custom session catalog, and may break custom session catalogs. This PR fixes it by not resolving table provider if custom session catalog is present.

### Why are the changes needed?

avoid breaking custom session catalogs

### Does this PR introduce _any_ user-facing change?

no, #43949 is not released yet.

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45440 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants